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ON  THE  INTEGRATED  SCHEDULING  OF  HARDKILL  AND 
SOFTKILL  ASSETS  USING  DYNAMIC  PROGRAMMING 

1  INTRODUCTION 

In  this  report  we  investigate  the  Anti-Ship  Cruise  Missile  (ASCM)  point  defense  problem. 
Our  focus  is  the  integrated  employment  of  defensive  systems  that  h^ve  a  potential  for  harmful 
Interference.  Because  most  shipboard  defensive  systems  exploit  the  electromagnetic  spectrum,  the 
potential  for  interference  exists  between  many  of  these  systems.  Often  the  adverse  effects  of  this 
interference  can  be  eliminated  during  system  design  or  through  retrofit  programs.  When  this  is  not 
practical,  simple  and  effective  policies  have  been  developed  for  cases  in  which  system  effectiveness 
is  restricted  to  disjoint  range  bands  (1].  The  case  of  overlapping  effective  regions  is  considerably 
more  complex  and  less  well  understood,  and  it  is  the  focus  of  this  study. 

Modem  ASCMs  rely  on  a  combination  of  high  speed,  low  altitude,  and  internal  guidance  to 
reduce  the  period  between  ASCM  detection  and  ASCM  impact  to  under  one  minute.  The  concept 
of  defense  in  depth,  illustrated  in  Fig.l,  has  been  applied  to  the  design  of  defensive  systems  to 
increase  survival  probability.  Multiple  systems  that  use  different  techniques  and  are  effective  in 
different  zones  are  used  to  minimize  the  probability  that  the  ASCM  will  reach  the  ship.  Outer  zone 
systems  operate  beyond  the  range  of  on  board  sensors  and  are  provided  by  other  platforms.  Point 
defense  systems  are  restricted  to  the  middle  and  inner  zones.  Middle  zone  systems  are  typically 
effective  at  ranges  from  the  horizon  down  to  a  few  kilometers,  and  inner  zone  systems  typically 
operate  within  a  few  kilometers. 

Systems  that  work  by  disrupting  ASCM  guidance,  known  as  softkil)  systems,  are  most  effective 
In  the  middle  zone.  This  is  the  same  zone  in  which  medium  range  “hardkiir  systems  such  ns  Surface 
to  Air  Missiles  (SAM)  are  employed.  Softkill  systems  are  designed  to  disrupt  ASCM  guidance 
through  electromagnetic  effects,  but  sometimes  it  is  not  possible  to  eliminate  the  interaction  between 
a  softkill  system  and  other  defensive  systems.  To  better  understand  the  effect  of  this  interaction 
on  the  effectiveness  of  defensive  system  employment  policies,  we  will  investigate  the  interaction 
between  one  softkill  system,  chaff,  and  one  middle  zone  hardkill  system,  a  SAM  system.  We  chase 
a  SAM  system  because  other  hardkill  systems  have  demonstrated  more  limited  effectiveness  in  the 
middle  zone.  Our  choice  of  chaff  was  motivated  by  its  widespread  availability  and  by  the  difficulty 
of  finding  a  SAM  firing  schedule  that  does  not  unduly  reduce  its  effectiveness. 

We  begin  by  developing  system  effectiveness  models  for  tlie  chaff  and  SAM  systems  against 
a  single  ASCM.  These  single  ASCM  models  provide  the  basis  for  our  development  of  a  multiple* 
ASCM  engagement  model  that  we  use  to  compute  the  probability  of  surviving  an  attack  for  a  given 
defensive  system  employment  policy.  We  then  investigate  algorithms  to  find  optimal  policies.  Two 
approaches  to  the  development  of  an  optimal  policy,  exhaustive  search  atid  dynamic  programming, 
are  presented. 
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2  SYSTEM  EFFECTIVENESS  MODELS 

To  understand  the  interaction  between  the  SAM  and  chaff  systems,  it  is  necessary  to  under¬ 
stand  the  operation  of  each  in  some  detail.  Since  our  goal  is  to  understand  the  consequences  of 
harmful  interference,  we  will  introduce  only  those  details  necessary  to  quantify  the  effects  of  that 
Interference.  Furthermore,  where  the  interference  can  be  eliminated  by  tactics  that  do  not  reduce 
system  effectiveness  we  will  adopt  those  tactics  to  focus  our  attention  on  a  single  significant  interac¬ 
tion.  While  practical  application  of  this  research  will  require  more  detailed  models,  our  restricted 
focus  should  be  sufficient  to  develop  an  understanding  of  the  effect  of  such  interactions  on  the 
effectiveness  of  defensive  system  employment  policies. 

2.1  ASCM  Guidance 

Since  the  purpose  of  the  chaff  is  to  dcteive  the  ASCM’s  guidance  system,  we  begin  our  discussion 
with  a  brief  review  of  ASCM  guidance.  Modem  ASCMs  use  a  variety  of  guidance  techniques,  and 
some  use  multiple  sensors.  At  ranges  in  the  middle  zone,  radar  is  the  most  common  sensor  used 
for  ASCM  guidance.  Accordingly,  we  model  a  radar  guided  ASCM. 

A  radar  seeker  operates  by  radiating  microwave  energy  in  the  direction  of  the  target  and  process 
the  reflected  signal  to  determine  the  target’s  location.  The  seeker  that  we  model  is  a  low-resolution, 
monopulse  seeker  that  uses  leading  edge  tracking.  This  type  of  .w'ker  processes  signals  returned 
from  a  relatively  small  region  in  range,  which  is  known  as  the  “range  gate.”  The  attacker  programs 
the  initial  size  and  location  of  the  range  gate  to  ensure  that  the  ship  is  contained  within  it.  After 
detecting  the  ship,  the  seeker  periodically  measures  the  energy  received  in  each  part  of  the  range 
gate  and  then  updates  the  position  of  the  range  gate  so  that  the  target  remains  within  it.  Leading 
edge  tracking  is  a  simple  countermeasure  against  pulse  delay  techniques  that  might  be  used  by  a 
defender.  However,  a  leading  edge  tracking  seeker  is  susceptible  to  deception  by  chaff.  To  implement 
leading  edge  tracking,  the  ASCM  seeker  attempts  to  center  the  range  gate  on  the  nearest  edge  of 
the  target  by  biasing  the  tracker  to  place  the  leading  edge  of  the  signal  near  the  center  of  the  range 
gate. 


On  dte  Integrated  Scheduling  of  HardkiU  and  SoftJdB 


3 


2.2  Chaff  System 

Chaff  rounds  are  deployed  ballistically  from  the  ship  by  using  a  launcher  with  a  fixed  position 
and  orientation.  After  a  preprogrammed  delay,  the  chaff  round  blooms  into  a  large  cloud  of  con¬ 
ducting  strips  that  float  slowly  to  Earth.  The  strips  are  designed  to  efficiently  reradiate  incident 
electromagnetic  energy  in  the  range  of  frequencies  used  by  the  ASCM  seeker.  Because  the  range 
gate  initially  includes  the  ship,  we  must  select  a  launcher  orientation  and  a  bloom  delay  that  places 
the  chaff  cloud  near  the  ship  when  it  blooms,  if  we  hope  to  deceive  the  ASCM  seeker.  We  can  then 
increase  the  range  separation  (along  the  ASCM  ship  axis)  by  moving  the  ship  so  that  the  ASCM 
seeker  must  eventually  choose  betw  .1  the  two.  If  the  seeker  chooses  the  chaff  cloud,  and  we  also 
increase  the  cross-range  separation  between  the  chaff  cloud  and  the  ship  sufficiently,  the  ASCM 
will  miss  the  ship.  We  call  this  a  successful  “seduction.”  Unfortunately,  the  success  of  a  seduction 
attempt  is  difficult  to  determine  while  the  ASCM  is  in  the  middle-zone,  because  there  is  no  direct 
way  for  the  defender  to  measure  the  position  of  the  ASCM’s  range  gate,  and  the  trajectory  change 
caused  by  a  successful  seduction  is  very  small  at  those  ranges. 

The  defender  can  exploit  the  leading  edge  bias  of  the  ASCM  seeker  by  initially  placing  the  chaff 
cloud  between  the  ship  and  the  ASCM  and  choosing  a  ship  velocity  vector  that  simultaneously 
Increases  both  the  range  and  cross-range  separation  between  the  two  objects.  Figure  2  shows  this 
geometry  with  a  reference  frame  centered  on  the  moving  ship.  The  time  required  to  establish 
the  required  cross-range  separation  establishes  the  minimum  range  at  which  a  induction  can  be 
effective.  We  call  this  time  r/,  the  time  spent  by  the  ASCM  in  the  inner  zone  where  chaff  is 
Ineffective.  Similarly,  we  refer  to  the  time  from  ASCM  detection  until  the  ASCM  enters  the  inner 
xone  as  tm,  the  time  spent  by  the  ASCM  in  the  middle  zone 


Fig.  2  —  Chaff  seduction  geometry 


When  more  than  one  object  Is  in  the  range  gate,  the  behavior  of  the  ASCM  seeker  is  based 
on  the  combined  signal  return  from  all  of  the  objects.  The  signal  returned  by  the  ship  has  been 
observed  to  undergo  large  amplitude  fluctuations  as  ship  motion  and  multipath  effects  combine 
to  produce  constructive  and  destructive  Interference  in  the  signal  returned  by  a  small  number  of 
dominant  scatterers  in  different  locations  on  the  ship.  The  observed  amplitude  fluctuations  in  the 
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signal  returned  by  the  chaff  cloud  are  much  smaller  because  the  chaff  cloud  is  made-up  of  a  large 
number  of  small  scatterers  that,  experience  more  consistent  motion. 

Instead  of  presenting  a  choice  of  range  gate  positions,  we  could  force  the  ASCM  to  choose 
between  two  objects  that  remain  within  the  range  gate  but  separate  in  bearing.  Establishing 
the  requited  bearing  separation  requires  either  a  very  large  cross-range  separation  or  a  very  close 
approach  by  the  ASCM.  The  required  cross-range  separation  is  difficult  to  achieve  while  the  ASCM 
is  in  the  middle  zone,  because  the  available  time  and  ship  speed  are  limited.  For  this  reason 
“bearing  seduction”  is  an  inner  zone  phenomenon.  Hence,  we  will  restrict  our  attention  to  range 
gate  seduction,  which  we  henceforth  refer  to  simply  as  seduction. 

lb  develop  an  analytic  model  of  seduction  effectiveness,  we  performed  extensive  computer  sim¬ 
ulation  of  typical  seduction  scenarios.  The  C-based  Routines  for  Understanding  the  Interaction 
between  ships,  electronic  warfare  and  missiles  (CRUISE  Missiles)  simulation  developed  by  the 
Naval  Research  Laboratory  was  used.  A  ship  model  for  a  destroyer  was  used  in  conjunction  with  a 
chaff  model  for  super  rapid  blooming  offboard  chaff  and  an  ASCM  seeker  model  for  a  subsonic  sea- 
skimming  radar-guided  missile  using  a  monopulse  seeker  and  a  low-resolution  leading  edge  range 
tracker.  The  relatively  low  sea  state  (0.5  meter  root  mean  square  wave  height)  we  chose  increased 
sea  surface  reflections  and  thereby  created  significant  multipath  fading. 

Figure  3  shows  the  position  of  the  inner  and  outer  edges  of  the  ASCM’s  range  gate  as  the 
ship-chaff  separation  increases  during  a  typical  simulation  run.  The  range  extents  of  the  ship  and 
the  chaff  cloud  are  plotted  for  reference.  In  repeated  simulations,  the  ASCM  seeker  seemed  to  show 
a  marked  preference  for  tracking  either  the  ship  or  the  chaff,  even  when  both  were  in  the  range 
gate.  The  range  gate  was  observed  to  shift  from  the  ship  to  the  chaff  cloud  when  the  signal  return 
from  the  ship  underwent  a  deep  fade,  and  to  shift  back  when  the  signal  return  from  the  ship  again 
dominated  that  of  the  chaff  cloud.  Once  the  separation  became  so  large  that  only  one  object  was 
In  the  range  gate,  no  further  transitions  were  observed. 


Fig.  3  —  Range  gate  position 


To  understand  this  behavior  we  must  discuss  radar  seeker  design  in  more  detail.  The  optimum 
ratio  of  peak  signal  to  mean  noise  is  obtained  when  the  ASt'M's  radar  receiver  uses  a  filter  that 
is  matched  to  the  transmitted  pulse.  For  the  rectangular  transmitted  pulse  typically  used  by  a 
low-resolution  seeker,  a  single  point  scatterer  produces  an  output  signal  from  the  matched  filter 
that  initially  increases  linearly  with  time  for  the  duration  of  the  transmitted  pulse  and  then  linearly 
decreases  with  time  for  the  same  period.  Objects  with  range  extent  will  result  in  an  output  that 
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Is  the  magnitude  of  the  coherent  sum  of  several  such  triangles,  each  with  a  potentially  different 
amplitude,  phase,  and  time  delay.  Figure  4  is  a  conceptual  diagram  that  illustrates  the  time 
relationship  between  the  output  of  a  matched  filter  for  the  chaff  and  the  ship,  as  they  separate  in 

range. 


1  Chaff  -*>ShiT 


I  Ship-— Chaff 


V 


Rang*  Oat* 
«l  CUM 
c.  sap 


Separation 


Rang*  Oau 
on  CM 


CM  SNp 


CM 


SNp 


Rang*  Quo 
on  SNp 


tango  Galt  tango  Goto 

on  CM  «"  Snip 


Fig.  4  —  Range  gate  behavior 


To  capture  all  of  the  matched  filter  output  associated  with  a  single  seatlerer,  the  range  extent  of 
the  range  gate  is  chosen  to  correspond  to  twice  the  duration  of  the  transmitted  pulse.  To  implement 
the  leading  edge  tracking  feature,  we  integrate  the  output  of  the  matched  filter  over  the  range  gate, 
assigning  twice  as  much  weight  to  the  output  in  the  first  half  of  the  gate,  as  we  do  to  the  output 
erf  the  filter  in  the  second  half  of  the  gate. 

The  position  that  the  range  gate  would  assume,  if  only  the  chafT  or  only  the  ship  were  present 
and  if  the  signal  return  from  each  object  were  uniformly  distributed  in  range,  is  shown  for  reference 
in  Fig.  4.  The  range  gate  positions  actually  shift  slightly  as  the  relative  predominance  of  the 
scatterers  that  comprise  the  two  objects  change,  but  such  shifts  are  small  compared  with  the  size 
of  the  range  gate.  When  the  signal  returned  by  the  ship  fades  severely,  the  range  gate  moves  to 
the  chaff-based  position,  if  enough  of  the  matched  filter  output  resulting  from  the  chaff  is  still  in 
the  range  gate  at  the  time  of  the  fade. 

As  the  range  separation  between  the  chaff  and  the  ship  Increases,  the  ability  of  the  range  gate 
to  move  between  them  becomes  more  constrained.  When  they  are  superimposed  in  range  (at  the 
far  left  In  Fig.  4),  the  range  gate  can  move  easily  between  positions  based  on  a  dominant  return 
from  the  chaff  or  the  ship.  Eventually  a  separation  is  reached  beyond  which  the  leading  edge  bias  of 
the  tracker  precludes  chaff-to-ship  transitions.  Ship-to-chaff  transitions  remain  possible,  however. 
Finally,  a  sufficient  separation  is  reached  to  prevent  any  transitions.  If  the  range  gate  is  tracking  the 
chaff  when  this  separation  is  reached,  the  seduction  will  be  successful  If  the  ship  is  being  tracked, 
the  seduction  will  be  unsuccessful.  It  is  this  one-way  transition  period  that  makes  a  leading  edge 
tracker  particularly  susceptible  to  seduction  by  chaff  for  some  geometries. 

If  more  than  one  chaff  cloud  is  present  simultaneously,  the  same  analysis  can  be  applied  to  the 
motion  of  the  range  gate  from  one  chaff  cloud  to  another.  Superimposed  chaff  clouds  act  like  a 
single  chaff  cloud  that  is  more  dense,  while  closely  spaced  chaff  clouds  act  like  a  single  chaff  cloud 
with  a  larger  range  extent.  Chaff  clouds  with  a  sufficient  range  separation  operate  independently, 
and  only  the  one  closest  to  the  ship  is  capable  of  causing  a  seduction. 

2.3  SAM  System 

Like  an  ASCM,  the  SAM  that  we  model  uses  reflected  microwave  radiation  for  guidance.  Instead 
of  placing  the  radar  illuminator  in  the  SAM,  however,  a  large  shipfnmrd  microwave  illuminator, 
known  as  a  “SAM  director,”  is  used  to  illuminate  the  relatively  small  ASCM.  The  SAM  is  designed 
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to  fly  a  standard  trajectory  from  launch  until  guidance  information  is  received.  This  trajectory 
prevents  the  use  of  the  SAM  system  in  the  inner  zone.  Although  the  minimum  range  at  which 
the  SAM  system  can  be  used  is  not  necessarily  the  rame  as  the  boundary  between  the  inner  and 
middle  zones  we  defined  for  the  chaff,  the  difference  i-,  sight ,  arid  we  consider  them  to  be  the  same 
in  order  to  avoid  introducing  additional  notation. 

The  surveillance  radar  equipment  installed  on  modern  warships  is  able  to  detect  an  ASCM  that 
is  above  the  horizon  when  tactical  considerations  allow  its  use.  Against  a  sea-skimrning  ASCM, 
surveillance  radar  equipment  has  an  initial  defection  range  of  approximately  20  km.  Midcourse 
guidance  can  be  provided  by  a  separate  command  channel  or  by  homing  on  the  reflected  signal 
energy  from  the  SAM  illuminator.  Regardless  of  which  technique  is  used  for  midcourse  guidance, 
the. ASCM  must  be  illuminated  for  several  seconds  immediately  before  the  SAM  reaches  it  to 
facilitate  terminal  guidance. 

We  have  chosen  to  model  a  SAM  system  that  uses  reflected  signal  energy  for  midcourse  guidance, 
because  the  requirement  for  continuous  illumination  increases  the  potential  for  harmful  interference 
from  the  SAM  system.  SAM  director  illumination  is  also  limited  by  the  horizon  to  about  20  km 
against  a  sea-skimming  ASCM.  Because  the  ASCM  must  be  illuminated  for  SAM  guidance,  it  is  not 
sensible  to  launch  a  SAM  until  the  ASCM  is  detected.  Once  an  ASCM  is  detected,  however,  a  SAM 
can  be  launched  immediately.  Because  the  SAM  director  and  the  SAM  do  not  share  a  common 
time  base,  a  SAM  using  reflected  signal  energy  for  midcourse  guidance  has  no  way  to  determine 
the  distance  to  the  ASCM.  Therefore,  each  SAM  guides  towards  all  ASCMs  in  the  middle  zone  on 
the  bearing  illuminated  by  the  SAM  director.  We  assume  here,  for  the  sake  of  simplicity,  that  each 
SAM  tracks  towards  the  closest  ASCM. 

Because  SAMs  employ  a  proximity  fuse,  the  SAM  is  destroyed  when  it  reaches  the  closest  ASCM, 
regardless  of  the  fate  of  the  ASCM.  The  defender  on  the  ship  can  rapidly  determine  whether  an 
ASCM  has  been  destroyed  by  observing  the  reflected  SAM  director  signal. 

2.4  Chaff  Effectiveness  Model 

The  SAM  system  can  significantly  influence  the  probability  of  a  successful  seduction.  This 
results  from  domination  of  the  signal  returned  from  the  chaff  by  the  signal  returned  from  the  ship 
when  the  SAM  director  is  oriented  towards  the  ASCM.  The  effect  depends  on  the  relative  strength 
of  the  signal  returns  from  the  ship  itself  and  from  the  chaff,  the  design  of  the  SAM  director  antenna, 
and  the  design  of  the  ASCM  receiver. 

Because  we  wish  to  study  the  effect  of  this  interaction  rather  than  its  cause,  we  model  the 
interaction  by  introducing  an  overwhelmingly  dominant  point  scatterer  that  is  present  only  during 
SAM  director  illumination.  We  have  chosen  a  point  scatterer  with  such  a  large  radar  cross  section 
that  even  during  the  deepest  fade  it  will  dominate  the  signal  return  from  the  chaff. 

Figure  5  shows  the  effect  of  introducing  the  SAM  director  into  our  analysis  of  range  gate 
behavior.  In  that  figure  we  have  separately  depicted  the  matched  filter  output  associated  with  the 
SAM  director  by  using  dashed  lines.  As  before,  when  the  chaff  and  the  ship  are  superimposed  In 
range,  the  range  gate  can  move  easily  between  positions  based  on  a  dominant  return  from  the  chaff, 
the  ship,  or  the  SAM  director.  Eventually  separation  A  is  reached,  beyond  which  the  leading  edge 
bias  of  the  tracker  precludes  chaff-to-ship  transitions  when  the  SAM  director  is  not  illuminating  the 
ASCM.  Ship-to-chaff  transitions  (and  SAM  director-to-chaff  transitions)  remain  possible,  however, 
and  SAM  director  illumination  could  still  result  in  a  chaff-to-SAM  director  transition.  Subsequently, 
separation  B  is  reached,  at  which  the  matched  filter  output  from  the  SAM  director  signal  is  outside 
the  range  gate  whenever  the  range  gate’s  position  is  based  on  the  chaff.  Beyond  separation  B 
chaff-to-SAM  director  transitions  become  Impossible,  although  the  leading-edge  bias  of  the  tracker 
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Fig.  5  —  Range  gate  behvior  with  a  SAM  director 


still  permits  ship-to-chaff  (and  SAM  director-to-chafF)  transitions.  Finally,  separation  C  is  reached. 
Separation  C  is  sufficient  to  prevent  any  transitions  except  the  slight  change  in  position  associated 
with  a  ship-to-SAM  director  transition  or  a  SAM  director-to-ship  transition. 

To  achieve  a  successful  seduction,  the  range  gate  must  be  tracking  the  chaff  when  this  final 
separation  is  reached.  The  probability  of  this  occurrence  depends  on  the  signal  returns  from  the 
chaff,  the  ship,  and  the  SAM  director  during  the  seduction  attempt.  We  describe  in  detail  the 
case  in  which  the  range  gate  is  initially  tracking  the  ship  in  order  to  develop  an  expression  for  the 
probability  of  a  successful  seduction,  given  that  no  previous  seduction  attempt  was  successful. 

As  Fig.  3  shows,  at  each  instant  in  time  the  range  gate  is  captured  by  the  object  within  it 
that  is  producing  the  strongest  weighted  output  from  the  matched  filter.  When  the  SAM  director 
is  not  illuminating  the  ASCM,  the  ship’s  signal  return  usually  dominates  that  of  the  chaff,  even 
when  the  chaff  signal  is  weighted  more  heavily.  Thus,  when  both  objects  are  in  the  range  gate,  the 
range  gate  is  normally  tracking  the  ship.  Occasional  fades  by  the  ship  reverse  this  predominance, 
however,  and  lead  to  a  temporary  capture  of  the  range  gate  by  the  chaff. 

At  relatively  large  separations,  a  range  gate  capture  by  the  chaff  places  most  of  the  ship  outside 
the  range  gate.  When  this  occurs,  recapture  of  the  range  gate  by  the  ship  is  precluded  as  long  as  the 
SAM  director  remains  quiescent.  If  the  range  gate  still  includes  the  position  of  the  SAM  director, 
however,  the  extremely  strong  signal  from  the  SAM  director  captures  the  range  gate  whenever  the 
SAM  director  illuminates  the  ASCM.  Once  the  SAM  director  moves  out  of  the  range  gate,  SAM 
director  illumination  no  longer  affects  the  range  gate’s  position.  So,  if  the  range  gate  is  tracking 
the  chaff  when  the  separation  increases  to  the  point  where  the  SAM  director  is  outside  the  range 
gate,  a  successful  seduction  is  assured. 

Reference  to  Fig.  5  allows  us  to  construct  a  seduction  model  based  on  this  behavior.  We  have 
assumed  that  the  range  gate  is  initially  placed  over  the  ship  by  the  platform  that  launches  the 
ASCM.  To  ensure  that  the  chaff  cloud  is  initially  in  the  range  gate,  we  select  launch  parameters 
that  ensure  it  to  bloom  at  the  same  distance  from  the  ASCM  as  the  ship.  As  in  the  previous  case, 
the  range  gate  will  then  alternate  between  positions  based  on  the  chip  and  the  chaff,  usually  in  a 
position  based  on  the  ship,  until  separation  A  in  F  ig.  5  is  reached.  SAM  director  illumination  during 
this  period  simply  serves  to  reduce  the  time  spent  tracking  the  chaff  because  during  illumination 
the  position  of  the  range  gate  will  be  based  on  the  SAM  director  signal. 

Between  separations  A  and  B,  SAM  director  illumination  still  results  in  capture  of  the  range 
gate  by  the  SAM  director.  When  the  illumination  terminates,  the  range  gate  shifts  to  the  ship,  and 
capture  of  the  range  gate  by  the  chaff  becomes  possible.  Between  separations  B  and  C,  capture  of 
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the  range  gate  by  the  chaff  remains  possible,  but  the  SAM  director  is  no  longer  able  to  recapture 
it.  In  this  region,  if  the  range  gate  is  tracking  the  ship,  it  moves  slightly  to  track  the  SAM  director. 
If,  however,  it  is  tracking  the  chaff,  it  remains  on  the  chaff.  So  SAM  director  illumination  during 
this  period  simply  serves  to  lock  the  range  gate  in  its  present  position.  '■ 

Summarizing,  if  the  chaff  captures  the  range  gate  after  the  hast  SAM  director  illumination 
between  separations  A  and  B,  a  successful  seduction  is  assured.  The  probability  of  such  a  capture 
by  the  chaff  depends  upon  the  pattern  of  illuminrtion  between  the  time  of  that  last  illumination 
and  the  time  separation  C  is  reached.  A  complete  description  of  an  illumination  pattern  requires 
specifying  whether  the  SAM  director  is  illuminating  the  ASCM  at  each  time  step  in  this  period. 
Fortunately,  the  minimum  duration  of  a  SAM  llight  restricts  the  set  of  leasible  illumination  patterns 
to  those  that  contain  a  small  number  of  contiguous  periods  of  illumination.  The  maximum  number 
of  separate  no  illumination  pe.iods  depends  on  the  length  of  the  period  between  separations  B  and 
C,  the  minimum  range  of  the  SAM  system,  and  the  speed  of  the  SAM. 

Rirther  simplification  results  from  a  time  invariance  that  is  a  consequence  of  random  fading. 
First  consider  the  class  of  illumination  patterns  with  a  single  no  illumination  period  of  fixed  dura¬ 
tion.  Since  we  can  not  readily  control  when  the  ASCM  will  observe  a  fade  by  the  ship,  we  make  an 
a  priori  estimate  for  the  probability  that  a  fade  by  the  ship  will  occur  during  the  no  illumination 
period  that  is  sufficiently  deep  to  cause  a  range  gate  transition.  This  probability  estimate  depends 
upon  a  specific  set  of  conditions  that  includes  sea  state,  ship  heading,  and  ship  speed.  The  fades 
are  a  direct  result  of  ship  motion,  which  is  a  narrow  band  random  process.  For  simplicity,  we 
assume  here  that  for  any  time  chosen  at  random,  the  next  fade  is  equally  likely  to  occur  at  any 
time  within  an  interval  roughly  corresponding  to  a  fundamental  period  of  that  narrowband  process. 
Since  we  assume  the  phase  of  the  fading  patterns  to  be  uniformly  distributed,  we  hypothesize  that 
the  probability  will  be  the  same  regardless  of  when  die  no  illumination  period  begins.  We  call  the 
duration  of  this  single  quiescent  period  D.  At  the  beginning  of  the  period  the  ASCM  tracks  the 
ship,  having  just  shifted  there  from  the  SAM  director.  As  D  increases,  the  cumulative  probability 
of  a  sufficiently  deep  fade  by  the  ship  increases  as  well. 

Once  separation  C  is  reached,  the  outcome  of  the  seduction  attempt  is  completely  determined. 
We  define  the  time  at  which  separation  C  is  reached  as  our  reference  time  t a  and  define  the  D(t) 
to  be  the  unilluminated  time  that  has  been  accumulated  by  time  t.  The  probability  of  seduction  is 
thus  a  function  of  the  duration  of  the  no  illumination  period  at  the  reference  time 

Ps  ■  D(tft)  r-*  Pr {Successful  seduction  at  time  f/<;  D(Ir)).  (1) 

We  could  construct  a  similar  function  for  more  complex  illumination  patterns  as  well.  Doing  so 
would  improve  the  fidelity  of  our  model  by  identifying  the  effect  of  temporal  correlations  within  the 
possible  fading  patterns.  We  believe,  however,  that  we  can  bound  the  performance  of  the  chaff  with 
Eq.  (1).  Because  the  probability  is  computed  by  averaging  over  every  phase  of  every  possible  fading 
pattern,  the  Ps  function  provides  an  upper  bound  on  the  seduction  probability,  when  it  is  applied 
to  the  total  no  illumination  time  and  a  lower  bound  on  the  seduction  probability  when  applied 
to  the  duration  of  the  longest  no  illumination  period.  Since  either  bound  reflects  the  interaction 
between  chaff  and  SAM  employment  and  the  argument  for  the  upper  bound  is  easier  to  compute, 
we  use  the  total  no  illumination  time  as  the  domain  of  Ps{  )  instead  of  developing  a  more  complex 
chaff  effectiveness  model.  •  i 

The  computation  of  D  is  then  quite  straightforward.  If  there  is  SAM  director  illumination 
between  separations  A  and  B,  we  begin  counting  D  from  zero  at  the  end  of  the  last  such  illumination. 
When  there  is  no  such  illumination,  we  can  begin  counting  D  from  zero  when  separation  A  is 
reached  without  significantly  changing  our  analysis,  because  the  ASCM  is  very  likely  to  be  tracking 
the  ship  at  that  time.  If  there  is  SAM  director  illumination  between  separations  B  and  C,  we  stop 
incrementing  D  when  it  begins  and  resume  counting  when  it  ends.  is: 
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2.5  Measurement  of  Chaff  Effectiveness 

If  D(JLr)  =  0  the  continuous  illumination  prevents  a  successful  seduction  (Ps  =  0).  To  quantify 
the  remainder  cf  the  dependence  of  Ps  on  D  we  simulated  a  series  of  experiments  by  using  the 
CRUISE  Missiles  testbed.  Figure  G  shows  the  geometry  we  used.  This  geometry  results  in  sufficient 
range  and  cross-range  separation  between  the  chaff  and  tin'  ship  to  allow  a  seduction,  and  it  exploits 
the  leading  edge  bias  of  the  tracker  by  assigning  the  ship  a  velocity  component  way  from  tie* 
ASCM.  No  wind  or  current  was  applied,  and  unaccelerated  ship  motion  was  assumed.  Because  the 
maximum  range  of  the  SAM  system  against  low  altitude  ASCMs  depends  on  the  elevation  of  the 
SAM  director,  the  SAM  director  is  normally  placed  high  on  the  superstructure  near  the  middle 
of  the  ship.  For  this  reason  we  placed  the  SAM  director  35  m  directly  above  the  ship’s  center  of 
gravity. 
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Fig.  6  —  Monte  Carlo  simulation  initial  conditions 

The  ordinate  of  the  graph  in  Fig.  7  shows  the  empirical  probability  of  seduction  that  resulted 
from  continuous  illumination  from  the  beginning  of  the  run  until  the  separation  plotted  on  the 
abscissa  was  reached  and  then  no  furtlie;  ’.lamination  until  the  ASCM  reached  the  inner  zone.  The 
measure  of  separation  that  we  have  chosen  is  the  range  separation  between  the  ship’s  center  of 
gravity  and  the  geometric  center  cf  the  chaff  along  the  ASCM-to-chaff  line  of  sight.  We  calculated 
this  empirical  probability  by  conducting  a  set  of  simulation  runs  and  observing  whether  the  ASCM 
was  tracking  the  ship  or  the  chafT  at  the  end  ot  each  run.  In  each  trial  we  independently  selected 
a  pseudorandom  seed  for  the  ship  and  sea  motion  components  of  the  model.  This  results  in  trial 
outcomes  that  are  independent  and  identically  distributed.  Under  this  condition,  the  empirical 
probability  approaches  the  parameter  of  the  underlying  binomial  distribution  as  the  number  of 
runs  increases.  A  sufficient  number  of  runs  were  conducted  to  achieve  a  0.9  confidence  that  the 
true  seduction  probability  lies  within  a  ±0.1  confidence  interval  of  the  plotted  value.  Data  points 
are  plotted  at  approximately  25  m  intervals. 

Examination  of  Fig.  7  reveals  that  continuous  illumination  piist  375  in  of  range  separation 
always  prevents  a  successful  seduction.  Therefore,  375  m  in  this  geometry  corresponds  to  separation 
C  in  Fig.  5.  The  time  without  illumination  (D)  before  separation  C  is  reached  is  plotted  on  the 
top  of  the  graph.  The  data  show  a  nearly  linear  increase  in  seduction  probability  with  an  increase 
In  D.  The  limiting  probability  of  seduction  {in  this  case  1.0)  depends  on  the  relation  between  the 
signal  from  the  chaff  and  the  signal  from  of  the  ship  at  the  deepest  point  of  a  fade  by  the  ship  for 
this  geometry.  We  call  this  limiting  value  Pmax ■  The  time  without  illumination  that  is  required  to 
achieve  this  limiting  seduction  probability  depends  on  the  frequency  with  which  fades  occur,  which 
is  determined  by  the  ship  motion  and  the  sea  motion  in  the  simulation.  We  call  the  smallest  value 
of  D(tn)  that  maximizes  the  probability  of  a  successful  seduction  D„UIZ. 
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Fig.  7  —  Typical  seduction  probability  function 


Separation  B  in  Fig.  5  can  be  found  by  determining  the  position  of  the  range  gate  when  it  is 
tracking  the  chaff  and  then  by  computing  the  minimum  range  separation  necessary  [to  place  all  the 
energy  from  the  SAM  director  cutside  of  the  range  gate.  In  our  low-resolution  A$CM  seeker  we 
use  a  2  ns  range  gate  because  we  have  a  1  /xs  pulse  width  and  a  matched  filter.  When  the  leading 
edge  tracker  is  tracking  the  chaff,  the  trailing  edge  of  the  range  gate  is  located  approximately  0.63 
/xs  beyond  the  geometric  center  of  the  chaff  cloud  in  signal  space.  The  geometric  center  of  the  chaff 
cloud  in  signal  space  corresponds  to  the  center  of  the  chaff’s  flat  spot  in  Fig.  5.  When  coupled  with 
the  1  /xs  matched  filter,  a  1.63  /xs  round-trip  transit  time  difference  between  the  geometric  center 
of  the  chaff  and  the  location  of  the  SAM  director  (the  peak  of  ..he  SAM  director  signal  in  Fig. 
5)  Is  sufficient  to  prevent  the  signal  returned  by  the  SAM  director  from  influencing  the  tracker’s 
behavior.  For  simplicity  we  have  placed  the  SAM  director  directly  over  the  ship’s  center  of  gravity. 
Thus,  separation  B  is  245  m  for  this  geometry.  Note  that  we  have  implicitly  assumed  that  the 
bias  in  the  range  gate  is  sufficient  to  prevent  jthe  signals  reflected  by  closer  parts  of  the  ship  from 
significantly  affecting  the  leading  edge  tracker’s  behavior  in  this  case.  j 

Determination  of  separation  A  in  Fig.  5  requires  an  additional  experiment.  We  began  by  cre¬ 
ating  a  single  no  illumination  period  of  duration  Dmax  that  ends  when  separation  C  is  reached. 
We  then  shifted  the  no  illumination  period  earlier  in  time  while  maintaining  a  duration  of  D™, 
until  the  period  barely  lasted  past  separation  B.  If  a  reduction  in  seduction  probability  had  been 
noted,  the  smallest  separation  that  avoids  the  reduction  would  be  separation  A.  However,  no  such 
reduction  occurred.  From  this  we  conclude  that  separation  A  is  less  than  155  m  (the  smallest  sepa¬ 
ration  attempted)  for  this  geometry.  In  that  case  the  precise  value  of  separation  A  is  insignificant, 
because  a  lack  of  illumination  between  the  actual  value  of  separation  A  and  separation  B  would 
have  the  same  effect  as  a  lack  of  illumination  between  a  separation  of  155  m  and  separation  B. 
Therefore,  we  can  consider  separation  A  to  be  155  rn.  Although  this  choice  could  result  in  a  failure 
to  count  D  between  the  actual  value  of  separation  A  and  a  separation  of  155  m,  the  resulting 
seduction  probability  would  be  unaffected  because  either  the  seduction  probability  would  already 
be  maximized  or  an  intervening  illumination  would  reset  D.  ' 

We  found  similar  seduction  probability  functions  for  other  initial  geometries  as  well.  Variations 
in  ship  velocity  produced  the  expected  compression  or  expansion  of  the  time  axis.  Slight  shifts 
in  separations  A  and  C  were  also  observed. !  This  occurs  because  the  tracker  is  affected  by  the 
range  extent  of  the  ship  along  the  ASCM-to-ship  line  of  sight,  which  varies  with  the  ship’s  heading. 
Some  ship  headings  also  result  in  larger  or  smaller  signa.1  returns  from  the  ship,  which  changes  the 
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effectiveness  of  the  chaff.  When  present,  this  effect  results  in  compression  or  expansion  along  the 
seduction  probability  axis.  The  value  of  Pmax  is  easily  found  by  conducting  a  single  set  of  runs 
with  no  illumination.  Changing  the  initial  position  of  the  chaff  cloud  slightly  aid  not  affect  the 
outcome  of  our  experiments.  In  particular,  initial  range  separations  between  0  and  270  in  resulted 
in  truncated  versions  of  Fig.  7,  and  similar  initial  cross-range  separations  had  no  significant  effect. 
Varying  the  position  of  the  ASCM  in  the  middle  zone  had  no  significant  effect,  because  the  period 
between  multipath  fades  is  nearly  constant,  while  the  ASCM  is  in  the  middle  zone. 

2.6  SAM  System  Effectiveness  Model 

The  position  of  the  chaff  cloud  can  influence  the  performance  of  the  SAM  system.  If  the  chaff 
cloud  is  placed  directly  between  the  ship  and  the  ASCM  at  a  low  altitude  (near  the  end  of  its  life), 
it  could  prevent  the  SAM  direct  illumination  from  reaching  the  ASCM.  This  would  result  in  loss 
of  SAM  guidance  anu  require  destruction  of  the  SAM  before  it  reaches  the  ASCM.  This  situation 
can  be  avoided  by  choosing  a  chaff  bloom  position  that  places  it  astern  of  the  ship-to-ASCM  line 
of  sight  before  its  altitude  decays  enough  to  cause  a  problem.  Since  that  tactic  is  compatible 
with  the  requirements  for  optimal  chaff  effectiveness,  we  shall  adopt  it  and  consider  SAM  system 
effectiveness  in  isolation. 

Because  our  focus  is  on  the  SAM/chaff  interaction,  we  have  chosen  the  simplest  possible  model 
for  SAM  system  effectiveness.  We  model  the  effect  of  a  SAM,  intercept  with  a  constant  Pk  that 
represents  the  probability  that  a  SAM  intercepting  an  ASCM  in  the  middle  zone  will  destroy  it. 
The  parameter  Pk  can  be  chosen  based  on  simulation  results  or  operational  experience.  If  an 
ASCM  is  intercepted  but  not  destroyed,  it  can  later  be  intercepted  again  by  another  SAM.  By 
choosing  Pk  to  be  constant,  we  are  treating  each  intercept  as  an  independent  event. 


3  ENGAGEMENT  MODEL 


Although  the  individual  effectiveness  models  are  adequate  to  determine  the  distribution  on  the 
outcome  of  a  single  event,  the  integration  of  these  into  a  unified  whole  remains  to  be  done.  We  are 
interested  in  the  effectiveness  of  several  applications  of  the  SAM  and  chaff  systems  in  defending 
against  an  attack  by  multiple  ASCMs.  We  restrict  our  analysis  to  the  case  in  which  all  the  ASCMs 
arrive  from  the  same  direction,  and  each  travels  with  approximately  the  same  velocity,  because  by 
doing  so  we  simplify  the  engagement  model  while  preserving  the  interaction  we  wish  to  study.  We 
chose  the  arrival  angle  shown  in  Fig.  6,  because  it  accommodates  the  requirements  of  our  tactics. 
Since  Fig.  7  indicates  that  the  seduction  probability  is  closely  approximated  by  a  negative-going 
ramp  function,  we  have  used  that  approximation  to  construct  an  idealized  Ps  function  defined  as 


P S  —  Pmax 


Pmax  ~  min{A  Dn  lax  } 


(2) 


Inspection  of  Fig.  7  suggests  that  a  reasonable  value  for  Dmax  is  8.  The  interaction  we  wish 
to  study  is  only  present  when  Pmax  assumes  a  moderate  value.  If  high  assurance  of  a  successful 
seduction  were  possible,  SAM  employment  is  not  necessary.  On  the  other  hand,  a  low  value  for 
Pmax  keeps  seduction  from  being  worth  considering.  Thus,  the  interaction  we  wish  to  study  is  only 
significant  when  Pmax  assumes  a  moderate  value.  For  this  reason  we  have  chosen  to  assign  Pnujr  a 
value  of  0.5.  Similar  considerations  dictate  the  choice  of  a  moderate  value  for  Pk-  Therefore,  we 
have  somewhat  arbitrarily  assigned  Pk  -  0.3. 

In  developing  a  model  for  this  multiple  ASCM  problem,  which  we  will  cal!  the  ‘•engagement 
model,”  we  must  examine  the  effect  that  a  single  application  of  each  defensive  system  has  on 
different  ASCMs.  Because  it  is  extremely  unlikely  that  two  ASCMs  would  be  so  close  to  each  other 
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that  both  could  be  destroyed  by  the  same  SAM,  we  will  ignore  that  case.  Hence,  each  SAM  can 
affect  et  most  one  ASCM.  A  single  chaff  cloud  could,  however,  affect  every  ASCM  in  the  middle 
zone.  Each  ASCM  arriving  from  a  given  direction  will  observe  the  same  range  separation  between 
the  chaff  and  the  ship  at  the  same  time.  Since  we  have  restricted  our  attention  to  the  case  in  which 
all  ASCMs  arrive  from  the  same  direction,  every  ASCM  in  the  middle  rone  will  experience  each 
seduction  attempt  simultaneously. 

The  fading  on  which  seduction  depends  is  caused  by  multipath  propagation  and  by  changes 
in  the  ship’s  orientation.  Multipath  fading  occurs  when  the  length  of  two  ray  paths  differ  by  a 
half-wavelength  (modulo  the  wavelength).  The  dominant  multipath  interference  occurs  between 
the  direct  raypath  and  the  raypath  reflected  once  off  the  sea  surface.  ASCMs  with  the  same  path 
length  difference  (modulo  the  wavelength)  will  observe  synchronized  fading,  while  for  other  ASCMs 
the  observed  multipath  fades  will  occur  at  different  times.  Fading  due  to  changes  in  the  ship’s 
orientation  occurs  when  many  of  the  normally  dominant  scatterers  are  viewed  from  an  orientation 
in  which  their  reflectivity  is  low.  All  ASCMs  arriving  from  the  same  direction  should  observe 
synchronized  orientation-based  fading.  Although  it  should  be  possible  to  construct  an  accurate 
model  of  the  relationship  between  the  fading  observed  by  multiple  ASCMs,  we  have  chosen  to  treat 
simultaneous  seduction  attempts  as  mutually  independent  events.  This  choice  allows  us  to  specify 
state  variable  transition  probabilities  individually  rather  than  in  all  possible  combinations,  thereby 
reducing  the  complexity  of  the  model 

The  high  speed  of  each  ASCM  and  the  limited  number  of  ASCMs  that  an  adversary  could 
reasonably  use  naturally  leads  to  a  finite  time  horizon  formulation.  To  facilitate  computational 
solution,  we  have  chosen  a  discrete  time  specification  for  our  engagement  model.  To  simplify 
the  formal  specification  of  the  model  we  will  assume  that  all  moving  objects  travel  at  a  constant 
velocity.  This  assumption  allows  us  to  express  distances  as  time  periods,  eliminating  unnecessary 
unit  conversions. 


3.1  States  and  Controls 


Table  1  shows  the  state  variables  for  the  engagement  model.  We  use  the  index  i  to  distinguish 
between  similar  state  variables  that  refer  to  different  ASCMs  and  allow  i  to  range  from  1  to  m, 
where  m  represents  the  maximum  number  of  ASCMs  that  may  arrive.  Each  state  variable  is 
discussed  in  detail  below.  We  refer  to  the  entire  collection  of  state  variables  at  time  t  as  the  “state” 
at  time  t,  Xt-  Note  that  while  we  use  a  subscript  i  on  a  state  variable  to  indicate  the  associated 
ASCM,  we  use  the  subscript  t  on  the  state  (and  later  on  the  control  function)  to  represent  time. 
We  have  included  enough  information  in  the  state  to  construct  a  controlled  one-step  markov  model, 
which  is  required  for  one  of  the  optimization  techniques  that  we  will  consider. 


Table  1:  State  variables 


/ 
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We  sample  state  transitioiis  once  each  A  seconds,  with  the  first  transition  beginning  at  time 
0  and  the  last  at  time  r  -  A.  Because  the  next  state  may  depend  on  the  outcome  of  one  or 
more  random  events  (ASCM  detection,  SAM  intercept,  or  seduction  attempt),  the  transition  from 
the  present  state  to  the  next  state  will,  in  general,  be  stochastic.  Figure  8  specifies  the  allowed 
transitions  for  each  state  variable.  Each  arc  is  labeled  with  the  condition  under  which  that  arc 
may  be  taken.  When  the  transition  is  not  deterministic,  the  probability  the  transition  occurs  is 
separated  from  the  condition  by  a  comma. 


OtherwueQ  N 


L°Truc 


OherwiseQf  D , 


(C  >  CA)  v  ((CB  *  C  S  CA) a(S  *  0)) 
(CSCa)a(S<0) 


A,>A..  IPA(n.t) 

- TOTS - 

(S*t0)A(i  «=minfpQ£  A]  £  P* 


Otherwise 


Fig.  8  —  State  variable  transition  diagrams 


We  will  restrict  our  attention  to  ASCM  arrival  distributions  that  depend  only  on  time  and 
the  number  of  ASCMs  that  have  already  been  detected.  We  will  use  the  information  about  the 
number  of  previously  detected  ASCMs  to  limit  the  attacker  to  m  ASCMs.  This  is  a  relatively 
simple  formulation  that  reflects  both  the  attacker’s  lack  of  detailed  knowledge  of  the  defender’s 
state  and  the  defender’s  uncertainty  about  the  attacker’s  strategy  while  producing  a  controlled 
one-step  markov  model.  Therefore,  we  take  as  given: 

PAn,t,a) 

=  Pr{n  ASCMs  will  arrive  at  time  t  given  that  a  ASCMs  have  already  arrived}.  (3) 

Because  a  relatively  large  range  separation  is  required  between  chaff  clouds  if  separate  seduction 
attempts  are  to  occur,  relatively  few  seduction  opportunities  can  be  erected  for  each  ASCM.  We 
have  chosen  to  repeatedly  fire  chaff  rounds  at  the  minimum  effective  interval  to  maximize  the 
chance  of  a  successful  seduction.  This  decision  reduces  the  complexity  of  the  control  policy  we  seek 
without  eliminating  our  ability  to  observe  the  effects  of  the  SAM/chaff  interaction. 

Accordingly,  the  ship  has  available  one  control  L(t).  Setting  L(t)  to  True  represents  the  launch 
of  a  SAM  at  time  t.  Setting  L{t)  to  False  represents  foregoing  the  opportunity  for  a  launch  at 
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that  time.  L(t)  is  constrained  to  be  False  if  a  SAM  is  already  in  flight  or  if  the  SAM  inventory 
is  exhausted.  We  make  it  possible  to  enforce  an  inventory  constraint  by  defining  N  to  be  the 
number  of  SAMs  remaining.  Initially  we  set  N  to  Nq,  the  number  of  SAMs  that  are  available  at 
the  beginning  of  the  engagement. 

The  periodic  nature  of  chaff  deployment  is  represented  by  a  counter  C  that  represents  the  time 
until  the  next  chaff  cloud  will  next  be  properly  positioned  for  a  seduction  attempt.  C  initially  has 
a  value  Co,  which  represents  the  phase  of  the  periodic  chaff  replacement.  Subsequently,  C  counts 
down  modulo  Cmax  +  A. 

We  use  Ai  to  represent  time  remaining  before  ASCM  »  reaches  the  inner  zone.  A,  is  initially 
set  to  oo,  which  represents  a  remaining  time  greater  than  tm •  As  long  as  ASCM  i  is  outside  the 
middle  zone  there  is  a  chance  that  it  will  arrive  at  the  outer  boundary  of  that  zone  and  be  detected. 
There  will  be  n  arrivals  at  time  t  with  probability  P/t(n,  t).  By  convention  we  assume  that  ASCMs 
are  numbered  in  the  order  of  their  arrival.  Therefore,  ASCM  i  arrives  at  time  t  if  there  are  more 
arrivals  at  time  t  than  there  are  lower  numbered  ASCMs  that  have  not  yet  arrived.  When  ASCM 
t  is  detected,  Ai  is  set  to  tm-  Ai  subsequently  decrements  until  ASCM  i  reaches  the  inner  zone  or 
is  destroyed  by  a  SAM  intercept.  For  ASCMs,  which  reach  the  inner  zone,  we  allow  the  value  of 
Ai  to  continue  to  decrement  below  zero  and  interpret  negative  values  of  Ai  to  represent  ASCMs 
that  are  no  longer  in  the  middle  zone.  If  ASCM  i  is  the  closest,  ASCM  in.  the  middle  zone  (i.e.,  i  is 
the  smallest  j  for  which  0  <  Aj  <  tm)  and  it  is  destroyed  by  a  SAM  intercept,  we  set  Ai  to  -oo, 
representing  an  ASCM  that  will  never  reach  the  inner  zone  (and  thus  never  hit  the  ship). 

Intercepts  occur  when  the  most  recently  launched  SAM  reaches  the  closest  ASCM  in  the  middle 
zone.  We  assume  the  SAM  and  ASCM  both  move  with  constant  velocity  and  define  Vg  to  be  the 
ratio  of  the  velocity  of  the  SAM  to  the  velocity  of  the  ASCM.  The  distance  between  the  SAM 
and  the  ASCM  will  decrement  at  Vs  +  1  times  the  rate  at  which  A,  is  decrementing.  At  time  t, 
min{Aj(t)  :  0  <  Ai[t)  <  tm}  +  r/  more  seconds  would  be  required  for  the  closest  ASCM  in  the 
middle  zone  to  reach  the  ship.  If  we  divide  these  two  quantities  and  adjust  the  result  to  be  an 
integer  multiple  of  A,  then  at  the  time  the  SAM  is  launched  we  can  calculate  the  time  at  which 
the  intercept  will  occur  £/  as 


tj(t)  =  t  + 


minj{Aj  (£)}  +  77 
Vs+l 


The  state  variable  S  is  used  to  carry  this  information  forward  from  the  SAM  launch  time  to  the 
time  of  the  intercept.  Initially  S  is  set  to  -00,  where  we  interpret  negative  values  of  5  to  represent 
a  state  with  no  airborne  SAM.  When  a  SAM  is  launched,  S  is  reset  to  the  calculated  intercept  time 
t/.  Subsequently,  S  decrements  to  zero,  at  which  time  an  intercept  is  recognized.  Since  at  most 
one  SAM  can  be  in  flight  at  a  time,  a  scalar  value  suffices  to  represent  this  information. 

The  illumination  duration  D  remains  zero  until  separation  A  is  reached,  which  occurs  when 
C  =  Ca •  Subsequently,  D  is  incremented  by  A  whenever  there  is  no  airborne  SAM  and  reset  to 
zero  when  there  is  an  airborne  SAM.  Once  C  reaches  separation  B,  which  occurs  when  C  -  Cb,  D 
continues  to  increment  when  there  is  no  airborne  SAM  but  holds  its  value  when  a  SAM  is  airborne. 

The  Boolean  state  variable  T,  whether  ASCM  i  is  tracking  the  ship.  Initially  T)  is  True  Vt 
because  all  newly  detected  ASCMs  are  assumed  to  be  tracking  the  ship.  Seduction  attempts  occur 
when  C  =  0,  and  they  can  be  effective  against  any  ASCM  in  the  middle  zone  (i.e.,  any  ASCM  i  for 
which  0  <  Ai  <  tm).  At  each  seduction  attempt  T)  will  become  False  if  the  seduction  attempt  is 
successful. 

We  have  collected  all  of  the  parameters  of  the  engagement  model  in  Table  2  and  indicated 
typical  values  for  a  simple  instance  of  the  model.  While  these  parameters  are  often  somewhat 
arbitrary,  the  entire  set  of  parameters  must  be  chosen  in  a  consistent  manner.  For  instance,  tm 
must  be  some  multiple  of  A  so  that  it  will  be  possible  for  Ai  to  reach  zero. 
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Table  2:  Engagement  Model  Parameters 

Parameter 

Value 

Units 

Description 

A 

1 

seconds 

Time  quantum 

r 

147 

seconds 

Final  time 

TM 

48 

seconds 

ASCM  flight  time  in  the  middle  zone 

TJ 

12 

seconds 

ASCM  flight  time  in  the  inner  zone 

Co 

30 

seconds 

Time  of  first  seduction  attempt 

CA 

15 

seconds 

Value  of  C  for  separation  A 

Cb 

8 

seconds 

Value  of  C  for  separation  B 

Cmax 

19 

seconds 

Maximum  time  until  the  next  seduction  attempt 

Dmax 

8 

seconds 

Maximum  no-illumination  duration 

m 

3 

ASCMs 

Maximum  number  of  ASCMs  which  may  arrive 

No 

5 

SAMs 

Initial  SAM  inventory 

Pk 

0.3 

Probability  that  a  SAM  will  kill  an  ASCM 

Pmax 

0.5 

Maximum  seduction  probability 

Vs 

3 

ASCM  Speed 

Relative  speed  of  a  SAM 

3.2  Observations 

The  vclue  of  most  of  the  state  variables  can  be  observed  or  deduced  without  error  at  each  time. 
Figure  8  identifies  the  initial  value  for  each  state  variable  using  a  small  arrow.  Since  C  has  a  known 
initial  value  and  it  evolves  deterministically,  C(t )  is  known  a  priori  for  all  £,  regardless  of  the  policy 
selected.  The  evolution  of  N  is  completely  determined  by  the  control  L{t).  which  becomes  known 
as  it  ic  chosen  at  each  time.  So  N(t)  becomes  known  by  time  t. 

Although  Ai  evolves  stochastically,  its  value  can  be  observed  at  each  time.  Because  the  velocity 
of  an  ASCM  is  nearly  constant,  Ai  is  directly  proportional  to  the  distance  between  the  ASOM  and 
the  ship.  Thus,  A\  ( t )  can  be  observed  at  time  t  without  significant  error  by  using  surveillance 
radar  equipment.  Since  { Ai  (£)}™  is  known  in  this  way  at  time  £  and  L(t)  is  also  known  at  time  £, 
S(t)  can  be  calculated  by  using  Eq.  (4)  at  time  £.  And  finally,  once  S(£)  and  C(t)  are  known,  D(t) 
can  be  calculated. 

As  long  as  no  seduction  has  been  attempted  against  ASCM  i,  7*  evolves  deterministically  and 
its  value  can  be  computed.  Once  a  stochastic  state  transition  occurs,  however,  this  is  no  longer 
possible.  Furthermore,  it  is  impractical  to  observe  71  while  ASCM  i  is  in  the  middle  zone.  T,(t) 
again  becomes  known  with  certainty  only  after  ASCM  i  enters  the  inner  zone  and  reaches  either 
the  ship  or  the  chaff  cloud. 

We  will  refer  to  the  known  state  variables  at  time  t  as  the  observation  at  time  £,  Ot  and  define 
a  function  O  :  Xt  »-♦  Ot ■  Ot  consists  of  every  state  variable  in  Xt  except  7i(t)  for  those  ASCMs 
that  were  in  the  middle  zone  during  a  prior  seduction  attempt  and  have  not  yet  reached  either  the 
ship  or  the  chaff. 

3.3  Reward  Function 


We  assume  that  a  single  hit  by  an  ASCM  is  sufficient  to  disable  the  ship  for  the  remainder  of 
the  engagement.  For  small  ships  such  as  frigates  and  destroyers  this  is  a  reasonable  assumption. 
Therefore,  we  wish  to  find  a  policy  that  will  maximize  the  probability  that  no  ASCMs  will  hit  the 
ship  between  time  0  and  time  r. 
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By  a  policy  we  mean  a  set  of  functions  indexed  by  time,  tlie  range  of  each  being  the  control  to 
be  applied  at  that  time.  In  general,  we  would  like  to  haw  our  control  decision**  on  all  of  the  useful 
information  available  at  each  time.  If  we  were  to  choose  Xt  as  the  domain  for  the  control  function 
at  time  t  we  would  obtain  a  one-step  markov  model.  I'urthernioro.  since  Xt  capt tires  all  of  the 
relevant  information  about  prior  states  and  controls  that  is  necessary  to  determine  the  next  state, 
adding  prior  state  and  control  information  to  the  domain  of  the  control  function  would  not  change 
the  optimal  value  of  the  reward  function.  Unfortunately,  it  is  sometimes  impossible  to  observe  part 
of  Xt ■  When  part  of  Xt  cannot  be  observed,  a  control  function  that  requires  all  of  Xt  as  its  domain 
would  not  be  useful  to  a  defender. 

Another  obvious  choice  for  the  domain  of  the  control  function  at  time  t  is  £>.,  the  observation 
at  time  t.  Since  Ot  C  Xt,  this  choice  would  yield  a  ohe-step  markov  model  using  a  control  function 
that  could  be  used  directly  by  the  defender.  In  this  case,  however,  adding  prior  observations  to  the 
domain  of  the  control  function  could  potentially  improve  the  optimal  value  of  the  rewind  function. 
In  addition,  our  knowledge  of  system  dynamics  also  makes  it  useful  to  include  the  prior  controls 
we  have  applied  in  the  domain  of  the  control  function.  To  quantify  this  dependence  we  define  the 
Information  vector  at  time  t  to  be 

Io  ~  Oq 

It  =  {Oq,  . . .  ,Ot,  L(0), . . . ,  L{t  -  A)},  (i  e  {A . r-A}). 

The  policy  we  seek  is  a  set  of  functions  r  =  {/q  :  /( *— >  We  wish  to  find  the  policy  n 

that  maximizes  the  probability  that  the  ship  is  not  hit  when  that  policy  is  employed.  Therefore, 
we  choose  as  our  reward  function 

J„  s  Pr{Ship  not  hit;  7r } 

In  this  notation  we  have  explicitly  identified  both  tin*  event  ({  Ship  not  hit  })  and  the  parameter 
(ir)  that  determines  the  probability  of  that  event.  We  shall  use  this  notation  extensively  to  call 
attention  to  the  functional  dependence  of  a  distribution  on  certain  parameters. 

4  OPTIMAL  SCHEDULING  TECHNIQUES 

We  seek  to  find  an  optimal  policy  x*  for  which 

Jf  =  max{J»}. 

We  shall  present  two  techniques  for  finding  an  optimal  policy,  exhaustive  search  and  dynamic 
programming. 


4.1  Exhaustive  Search 

Exhaustive  search  is  a  brute-force  strategy  that  can  be  efficient  when  the  number  of  alternatives 
that  must  be  considered  is  small.  In  the  exhaustive  search  strategy  we  first  compute  the  value  of 
J ir  for  every  possible  policy  n  and  then  choose  the  policy  that  maximizes  that  value.  We  begin  by 
developing  an  algorithm  for  computing  J*.  Starting  with  the  definition  of  the  reward  function  we 
apply  the  law  of  total  probability; 

Jw  =  Pr{Ship  not  hit;  zr } 

=  1  —  Pr{Ship  hit;  ir} 

=  1-  £  Pr{Ship  hitjXo, ... .  ,Xr;jr}  •  Pr{X0, . . .  ,X>;5r}  (5) 

{*o . *r) 
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The  engagement  model  described  In  Fig.  8  specifies  ?r{X|+A|^«,  £(<)}  and  specifies  a  delta 
function  for  Pr{A'o}-  Since  L(t)  =  /i<(/,),  we  can  use  the  cne-step  markov  structure  of  the  model 
to  compute  the  second  probability  in  Eq.  (5): 

T-A 

Pr{X0)...,Xr;jr}  =  Pr{X0}  []  .//,(/,)}.  (6) 

1=0 

Computation  of  the  first  probability  in  Eq.  (5)  is  somewhat  more  complex.  In  our  analysis 
above  we  determined  that  the  ship  will  be  hit  if  and  only  if  at  least  one  ASCM  reaches  the  inner 
*one  without  being  seduced.  That  occurs  if  and  only  if  there  exist  an  ASCM  t  and  time  t  for  which 
both  Ai(t)  =  0  (i.e.,  at  time  t  ASCM  i  reaches  the  inner  zone)  and  7) ( t )  =  True  (i.e.,  at  time  t 
ASCM  *  is  tracking  the  ship).  This  formulation  captures  the  effect  of  a  successful  seduction  directly 
by  using  7)  but  relies  on  A,  to  keep  ASCMs  that  are  destroyed  by  SAMs  from  being  considered. 
Using  the  indicator  function  that  has  value  1  when  the  event  occurs  and  0  when  it  dot's 

not 

Pr{Ship  hit|Ao,..., XT;ir}  =  /j3(i.i)(>i,(t)=o)A(T,(t)=TVue)i  (7) 

Combining  Eqs.  (5),  (6),  and  (7)  and  then  eliminating  zero  terms  from  the  summation  we  get 

r-A 

E  /{3(U)M.(0=o)A(T,(t)='ivUe)}  •  Pr{*o}  ■  I]  Pr{*i+Al*„,n(/i)} 
(Xo.-JCt) 

T-A 

E  Pr{*0}-  I]  pr{*t+Al*!,Mi(/i)}  (8) 

{Xo....^fT)33(<-‘)(^.(0=0)A(ri{i)=T>ue)  1=0 

Since  we  can  repeatedly  apply  the  definitions  of  /i(,  !t,  and  O  to  find  /q(/t),  given  {,Yo . X,}, 

this  equation  provides  a  somewhat  cumbersome  way  to  compute  Jn  for  any  policy  7r.  By  trying 
every  policy,  we  will  eventually  find  the  one  that  maximizes  Jn.  Examination  of  Eq.  (8)  reveals  that 
this  computation  will^be  most  efficient  when  the  time  horizon  is  short  and  the  number  of  possible 
state  sequences  is  small.  Since  the  computation  is  repeated  for  every  policy  ir,  practical  application 
of  this  technique  is  only  possible  when  there  are  few  times  when  more  than  one  alternative  is 
available. 

4.2  Incremental  Computation  of  the  Reward  Function 

The  exhaustive  search  technique  requires  that  every  possible  path  through  the  state  space  be 
considered.  Multiple  control  alternatives  or  stochastic  state  transitions  will  result  in  exponen¬ 
tial  growth  in  the  number  of  paths.  If  we  define  computational  complexity  to  be  the  greater  of 
the  asymptotic  time  or  space  requirements  of  an  algorithm,  the  computational  complexity  of  the 
exhaustive  search  technique  grows  exponentially  as  the  number  of  time  steps  is  increased. 

The  controlled  one-step  markov  structure  of  the  engagement  model  suggests  that  It  may  be 
useful  to  view  this  optimization  problem  as  a  multistage  decision  process  and  consider  incremental 
computation  of  Incremental  computation  can  result  in  greater  efficiency  by  computing  inter¬ 
mediate  values  for  each  unique  subpath  only  once.  Two  approaches  to  incremental  computation 
are  possible. 

Incremental  computation  forward  in  time  is  done  by  walking  a  decision  tree  to  find  the  optimal 
policy  and  considering  several  branches  at  each  step  in  which  a  stochastic  state  transition  occurs 
Only  reachable  states  are  considered  when  walking  a  decision  tree.  As  the  number  of  reachable 
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states  becomes  large,  however,  a  tree  walk  will  consider  many  of  these  states  more  than  once.  This 
requires  either  repeating  the  computation  each  time  a  state  is  considered  or  storing  the  previous 
result.  If  recomputation  is  chosen,  the  computational  complexity  of  a  tree  walk  grows  exponentially 
as  the  number  of  time  steps  is  increased.  When  intermediate  results  are  stored,  a  large  interme¬ 
diate  storage  area  that  grows  as  the  number  of  time  steps  is  Increased  will  lx*  minimi;  however, 
computational  complexity  will  grow  linearly  with  the  number  of  time  steps. 

Incremental  computation  backward  in  time  is  done  by  constructing  a  state  lattice  backward 
from  each  possible  final  state.  The  algorithm  for  incremental  computation  backward  in  time  is 
known  as  dynamic  programming.  Because  every  state  is  considered  only  once  at  each  time,  the 
computational  complexity  of  a  dynamic  programming  solution  grows  linearly  as  the  number  of  time 
steps  is  increased.  Although  dynamic  programming  requires  intermediate  storage  for  one  value  for 
each  state,  the  required  storage  area  does  not  change  as  the  number  of  time  steps  is  increased.  For 
that  reason,  we  will  next  apply  the  dynamic  programming  algorithm  to  this  problem. 

4.3  Dynamic  Programming 

In  dynamic  programming  we  seek  to  reduce  the  problem  of  selecting  the  optimal  policy  to  a 
sequential  selection  of  optimal  control  functions  for  each  time.  The  dynamic  programming  algorithm 
works  by  associating  with  every  state  at  time  t  a  value  that  represents  the  expected  reward  that 
would  be  earned  if  we  started  in  that  state  at  time  t  and  employed  an  optimal  policy  between  time 
t  and  time  r.  This  set  of  optimum  expected  rewards  can  then  be  used  to  compute  the  optimum 
expected  reward  associated  with  any  state  at  time  t  -  A,  if  the  incremental  effect  of  each  possible 
control  that  could  be  applied  from  that  state  at  time  t  -  A  is  known. 

Conventional  dynamic  programming  is  valid  for  a  reward  function  fornuxl  as  the  expectation  of 
a  sum  of  partial  rewards.  Appendix  A  describes  a  similar  algorithm  for  a  reward  function  that  is 
the  expectation  of  a  product  of  partial  rewards.  Appendix  A  identifies  five  properties  that  a  model 
must  possess  before  that  dynamic  programming  algorithm  can  be  applied: 

(a)  A  finite  set  of  states 

(b)  A  control  function  with  the  state  as  its  domain 

(c)  A  controlled  one-step  markov  model  in  which  the  distribution  on  the  next  state  is  determined 

by  the  oresent  state  and  the  present  control. 

(d)  A  set  of  nonnegative  partial  rewards  functions,  each  of  which  has  the  state  at  one  time  as  its 
domain. 

(e)  A  reward  function  that  is  the  expectation  over  the  state  sequence  of  a  product  over  time  of 
those  partial  rewards. 


4-3.1  Computation  of  the  Reward  Function  Using  the  Information  Vector 

Because  we  do  not  have  perfect  knowledge  of  the  state  at  each  time,  we  must  use  the  information 
vector  It  as  the  domain  of  control  function  at  time  t.  To  satisfy  property  (b)  we  must  treat  /,  as 
if  it  were  a  state  in  our  development  of  a  dynamic  programming  solution.  That  choice  satisfies 
property  (a)  because  It  is  a  finite  union  of  finite  sets. 

Property  (c)  then  requires  that  U+&  be  a  known  stochastic  function  (called  the  next  state 
function)  of  It  and  pt(/t).  We  will  show  the  existence  of  a  next  state  function  for  /t  by  briefly 
sketching  its  derivation.  We  begin  by  applying  the  definition  of  /,  to  the  next  state  function 
from  property  (c)  in  Appendix  A  and  then  simplify  the  resulting  expression,  thus  observing  that 
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Pr{/t+A|/i;  M)) 

=  Pr{O0, . . . ,Ot,Ot+Al  i(0), ...,L(t  -A), £(t)|Ob, 
=  Pr{O<+A|O0,  •  •  •  .Oc  L(0) . £{f  -  A);  L(t)). 


,Ot,L(0),  —  Hi  -  A); /!<(/()} 


(9) 


Now  recall  that  each  At+A  is  associated  with  exactly  oi;£  Ot+ a-  So  O  partitions  the  state 
space.  We  can  therefore  express  the  value  in  Eq.  (9)  as  a  sum  of  probabilities,  then  apply  the  law  of 
total  probability  and  simplify  the  result  by  using  the  one-step  jnarkov  property  of  the  engagement 

model: 

Pr{/«+A|/i;  *(/*)} 

=  E  Pr{*i+A|Ob . Ot,  L(0),  A);  L{1)) 

X|+a3Oi+a=0(X,+a) 

E  E 

.  X<+A5O,+A=0(X,+  A){Xo . X,)3{O0 . O,)  =  ©((X0....|X,|) 


Pr{At+A 

Pr{Xo,.J 


A0,  .  • ,  A(,  O0 . Ot,  L( 0), . . . ,  L(t  4-  A);  1(f)} 

A||Olo, . . .  ,0|,  L(0),  A);  /!(<) } 


X,+a*>i+a=0(X1  +  a){Xo . X,)3{Oo . 0,)  =  0({Xo....:X,() 

Pr{A/+A|A<;  L(t)}  •  Pr{Ao, . . . .  A,|0„ . O,.  L( 0) . 1(1-  A):  /.(/)} 


{Xo . X,+d}3{Oo.  ,0ltA)  =  0({Xu . X,^J) 

Pr{A<+A|At;  L(f)}  •  Pr{ A0, ....  A,|0„ . Ot,  L( 0) . L(t  -  A);  L(t)}  (10) 


The  first  probability  in  Eq.  (10)  is  specified  in  Fig.  8  and  the  second  can  be  found  by  first 
applying  Bayes’  theorem,  then  simplifying  the  result  through  the  observation  that  in  Eq.  (10) 
{ Oo,...tOr }  =  0({Ao,  ...,At}),  and  finally  by  using  the  markov  property  to  write  the  joint 
probability  as  a  product  conditional  probabilities 


Pr{Ao, . . . ,  Xt\Oo, . . . ,  Ot\ tt} 

_ Pr{A0, . . . ,  A(;  tr)  •  Pr{C>0, . . .  ,Oi\Xp . A,;*} _ 

£(Xo,...,X.}3{Oo . o,}=c»({Xg . x«))  Pr{*° . At;  7T }  •  Pr{O0, . . .  ,O<|A0, . . . ,  Aj;  ir} 

_ \  Pr{A0,...,At;tr) _ 

E{Xo...^r.)3{Oo . 0|)=0({X  o . x,)>  Pr{Ao . A,;  tt} 

=  _ Pr{^o}nl<~iPr{A(^A|At;y)} _ 

E (Xo.... jr, }3{Oo . o, ) =o( (Xo . X,})  Pr{Ao}  •  n[>=o 'Pr{At'+A|At»;  L(t').} 

So  the  next  state  function  is  well  defined  and  property  (c)  is  satisfied.  Properties  (d)  and  (e) 
require  that  the  reward  function  be  formed  as  the  expectation  over  the  state  sequence  of  a  product 
over  time  of  nonnegative  functions  of  the  state.  To  put  our  reward  function  in  this  form,  we  start 
with  the  definition  of  the  reward  function  and  again  apply  the  law  of  total  probability  followed  by 
the  definition  of  expectation  to  obtain: 

Jw  =  Pr{Ship  not  hit;rr} 

=  1  —  Pr(Ship  hit;  7r } 

=  1-  E  Pr{C\>, . . . •  Pr{Ship  liitjOo . Or;*} 

(Oo . Oj) 

=  1-  E  |Pr{Ship  hitjC>0> . . .  .Or;  7r};  7rl. 

,(Oo . or } 


(11) 
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In  our  notation  for  expectation,  we  place  the  parameter  on  which  the  distribution  depends 
inside  the  brackets  after  a  semicolon.  Now  we  can  derive  an  analytic  expression  for  the  probability 
that  the  ship  is  hit  by  recalling  the  condition  we  introduced  earlier 

Pr{Shiphit|O0,...,OT;jr}  =  Pr{3(»,f)(A,(0  =  0)  A (T’,(f)  True)|O0 . 0T;?r} 

=  i  -  Pr{V((»,t)  3  Ai(t)  =  Q){Ti(t)  =  False) |O0 . 0T;ir} 

=  1-  J]  Pr{Tj(t)  =  False|(?o . Or  ;*} 

The  last  step  is  based  on  the  mutual  independence  of  T,(t)Vj  at  every  time  t,  given  the  same 
control  sequence.  This  independence  is  a  consequence  of  the  known  value  of  7i(0)  and  the  (assumed) 
mutual  independence  of  simultaneous  seduction  attempts.  Substituting  our  result  into  Eq.  (11)  we 
get 

J,  =  1-  E  11  -  II  Pr{^(0  =  False|Oo,...,0T;tr};7r] 

. °Tl  (i.t)3Ai(t)=0 

J,  =  E  (  TT  MW)  =  False|O0 . Or;*}:*|.  (12) 

{Oo,—,°T } 

Since  Ti  ( t )  is  not  known,  we  must  compute  its  distribution  based  on  the  available  information. 

Pr{Tj(t)  =  False|C\j, . . .  ,Ot;  tr} 

=  1  -  Pr{Tj(<)  =  TVuelOo, . . . ,  0T;  tr} 

=  1  —  Pr{No  successful  seduction  attempt  against  ASCM  i  by  time  •  ■ .  ,Ot\ rr}. 


The  relevant  seduction  attempts  for  each  ASCM  are  those  that  occur  while  it  is  in  the  middle 
zone.  From  the  state  transition  diagram  for  1)  in  Fig.  8,  we  see  that  seduction  attempts  only 
occur  when  0  <  Ai  <  tm  and  C  —  0.  Using  this  observation,  the  markov  structure  of  Ti,  and  the 
definition  of  Ps  we  proceed  as  follows 

Pr{Ti(t)  =  False[O0, . . .  ,0T;  ir } 

=  1  -  Pr{V(t'  <  t)((C(t')  =  0)  A  (0  <  Ai(t')  <  tm)) 

=>  (Failed  to  seduce  ASCM  »  at  t')|Oo, . . :  ,Ot\ *) 

=I  1  —  Pr{(FaiIed  to  seduce  ASCM  t  at  t')|Oo.  •  • .  ,Or\ ff} 

(«'S«)3(C(t')=0)A(0<Mi(t')<TW) 

=  1“  n  (1 -Pr{(Scduced  ASCM  t  at  t')|Oo,..., Or; ff}) 

(‘'<»)3(C(t')=0)A(0<Aj(C')<r*,) 

=  i-  n  a  -PsW))).  (i3) 

(»'S«)3(C(t')=0)A(0</t,(«')<rw) 


Combining  Eqs.  (12)  and  (13)  ard  then  separating  the  outer  product  into  products  over  t  and 
i  we  get: 

■>'  =  ,.u  n  a-  n  (i.-ft(cto)))i*i 

10 T]  {i,t)3Ai{t)= 0  (t'<«)9<C(t*)=0)A(0<^1(j')SrM) 


T 


,  E 

(Oo . Or) 


m  n  (» 

‘=°«3^,(t)=0 


f]  u  -  Psmt'm,*]. 


(14) 


We  now  observe  that  the  policy  jt  establishes  a  one-to-one  correspondence  between  the  sequence 
of  observations  {Oo,  ...,Or}  and  the  sequence  of  information  vectors  {/o,  •  •  • ,  /r  }•  Therefore,  we 
can  rewrite  Eq.  (14)  to  get  an  expression  in  the  form  required  to  apply  dynamic  programming: 
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Jn  *  E  (ft  n  (!“  n  (l-Ps(D(«'))));jr) 

i=Oi3/l4(t)=0  (t'£«)9(C(£')=0)A(0<i44(t')<rA/) 

= e  in».(«i  o« 

{/o.-Jr)  (=o 

where  we  define, 

*(/,)  =  II  0-  II  (1  -  Ps(0(l'))))  V/  €  {0, ..  ,r}.  (If.) 

«44(l)=0  (t'<t)3(C(t')=0)/\(0<A,(t')<TM) 

The  form  of  Eq.  (15)  satisfies  property  (e).  Equation  (16)  satisfies  property  (d)  because  9t{U) 
can  be  computed  without  reference  to  states  for  times  later  than  t.  In  particular,  <?<(/«)  can  be 
computed  solely  by  reference  to  C(t'),  Ai(t' ),  and  D(t')  for  t'  <  t.  Furthermore,  gt{h)  is  nonnegative 
because  it  is  the  probability  that  every  ASCM  leaving  the  middle  zone  at  time  t  has  been  seduced, 
and  probabilities  lie  in  the  interval  [0, 1].  We  have  therefore  demonstrated  all  five  of  the  properties 
required  to  apply  the  dynamic  programming  algorithm  in  Appendix  A.  As  a  result,  we  can  compute 
*/»•  as, 

Vt(It)  =  9r{Ir)  (17) 

Vt{U)  =  max{  E  |tW/<+A)  •  9t{U)\M)\  (*  €  {0 . r  -  A})  (18) 

£“(A)  {/(+A} 

j,-  =  e  mh)} 

|/o> 

4-S.S  Constraining  State  Space  Growth 

As  t  increases,  the  cardinality  of  /<  grows  rapidly.  This  means  that  the  maximization  in  the  dy¬ 
namic  programming  algorithm  must  be  performed  over  functions  with  rapidly  expanding  domains. 
However,  very  little  of  the  information  in  It  is  actually  used  each  time  to  compute  the  reward 
function.  Using  this  insight,  we  will  construct  a  related  optimization  problem  with  a  state  of  fixed 
cardinality  that  can  be  used  to  find  the  optimal  policy  for  our  original  problem. 

Examination  of  Eq.  (16)  reveals  that  the  partial  reward  at  time  t  depends  on  the  value  of  D 
for  every  time  an  ASCM  leaving  the  inner  zone  at  time  t  might  have  been  seduced.  By  including 
this  information  in  our  new  state,  we  can  construct  a  one-step  markov  model  without  recourse  to 
an  information  vector.  Equation  (13)  can  be  used  to  interpret  the  inside  product  in  Eq.  (16)  as 
the  probability  that  7)  is  True  at  time  t.  We  define  a  function  7  to  compute  that  inside  product 
as  follows 

T(t,It)=  I]  (1  -Ps(D(t’)>)  (19) 

(j'£«)3(C(t')=o)A(o<a.(t')<TM) 

For  convenience,  we  will  also  define  a  set  of  symbols  (Tj, . . .  ,Tm)  as 

ft(l)  =  T(Ut).  (20) 

Equation  (14)  then  becomes 

*  =  e  in  n  <i -*('»;  4  pc 

tUo . Vt>  t=0i3Ai(t)=0 

It  Is  possible  to  compute  7)(tJ  from  T*(f  -  A ),C(t  -  A),  Ai(t  -  A),  and  D{t  -  A)  as  shown  in 
Fig.  9.  To  see  why,  recall  that  T,(t)  represents  the  probability  no  seduction  attempt  for  ASCM  t 
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Fig.  9  —  Transition  diagram  for  7) 

has  been  successful  by  time  t.  Before  ASCM  i  is  detected,  that  probability  is  1.  When  a  seduction 
attempt  occurs  with  ASCM  i  in  the  middle  zone,  Tj  (/)  is  multiplied  by  the  probability  that  the 
seduction  attempt  is  unsuccessful.  Figure  1  is  a  transition  diagram  for  T,  . 

We  are  now  ready  to  formally  define  our  new  model.  We  begin  with  the  counterparts  to  /t, 
Ht(  ),  t  nd  it 

Ot  =  Otu(Jfj(t) 

fit  :  Ot  *-*  L(t),  (t  €  {0, . . . ,  r  —  A}) 

We;  can  replace  the  <x-algebra  defined  by  Ot  in  Eq.  (21)  with  the  finer  ff-algebra  defined  by  Ot 
to  get!an  expression  for  the  reward  function 

•>*  =  e  in  n  0-^(0);  4  (22) 

{Oo . Or  }  «=0Oyti(t)=0 

Before  we  can  apply  the  dynamic  programming  algorithm  we  must  replace  7r  with  7r  to  satisfy 
property  (b).  Accordingly,  we  define 

I 

I  j*  =  E  in  n  (1 -£(*));*]  (23) 

{d0,...,OT}  t=0i5Ai(t)=0 


and  seek  to  find 


=  max{J*}. 


We  claim  that  maximizing  J*  is  equivalent  to  maximizing  Jn  in  the  sense  that  for  every  optimal 
policy  it*  there  exists  an  optimal  policy  ft*  such  that  ,/*•  =  We  prove  this  claim  in  Appendix 
B  by  showing  that  pt(/t)  can  be  chosen  to  be  the  same  for  every  /£  that  corresponds  to  the  same  Ot 
withotit  changing  the  optimal  reward.  This  means  that  pt(Ot)  -  Ht(T{t ,  It))  is  an  optimal  control 
function  for  the  original  optimization  problem. 

Now  we  are  ready  to  show  that  the  same  five  properties  hold  for  the  new  optimization  problem. 
We  will  treat  Ot  as  the  state.  Property  A(a)  is  satisfied  because  the  periodic  launch  of  chaff  limits 
us  to  a  small  number  of  seduction  attempts  for  each  ASCM.  Since  D  takes  values  from  a  finite  set 
and  the  number  of  products  of  functions  of  D  we  are  performing  is  finite,  7j  takes  values  from  a 
finite  (set.  Therefore,  Ot  is  a  finite  set. 

Property  A(b)  is  satisfied  because  the  domain  of  the  control  function  is  the  state.  Furthermore, 
the  control  policy  will  be  useful  to  a  defender.  Since  Oq  is  known,  and  each  7)  (t)  is  computed  in¬ 
crementally  based  only  on  information  contained  in  71(f)  becomes  known  by  time  t.  Because 
Ot  is  also  known  by  time  f,  we  conclude  that  Ot  becomes  known  by  time  t.  So  the  domain  of  the 
control  function  is  known  in  real  time. 
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The  transition  diagram  in  Fig.  9  is  deterministic,  and  Fig.  8  can  be  used  to  determine  the  next 
state  function  for  Ot  without  knowledge  of  th''  outcome  of  stochastic  transitions  of  7).  So  property 
A(c)  is  satisfied  as  well. 

To  apply  the  dynamic  programming  algorithm  we  define 

9t(Ot)=  n  (1  —Ti(t))\/i  €  {0, . .  .,r).  (25) 

»3^i(«)=0 

This  allows  us  to  rewrite  the  reward  function  in  the  form  required  by  property  (e): 


3* 


E  . 

{d0,...,dT) 


1=0 


*1 


This  quantity  can  be  computed  based  only  on  Ot,  and  it  is  nonnegative  because  it  is  a  probability. 
So  properties  (d)  and  (e)  are  satisfied.  Since  all  five  properties  in  Appendix  A  are  satisfied  we  may 

write: 

Vr(Or)  =  9t{Qt)  (26) 

ft(Ot)  =  max  {  E  IW^i+a)  ^i(Oi);Ai(Oi)J},  (l  6  {0 . r-A))  (27) 

{Ado,)}  {d>(+A} 

3*-  =  E  lVb(O0)l  •  (28) 

{do} 


5  DYNAMIC  PROGRAMMING  IMPLEMENTATION 

Three  well-known  ways  have  been  developed  in  which  dynamic  programming  equations  can  be 
used  to  find  optimal  controls.  The  equations  can  be  used  to  develop  analytic  proofs  of  optimality, 
or  they  can  be  used  in  one  of  two  numerical  techniques:  policy  iteiation  or  value  iteration.  In  the 
following  paragraphs,  each  of  these  approaches  are  described.  Because  it  is  the  mast  straightforward 
of  the  three,  we  then  consider  value  iteration  in  detail.  Applicability  of  the  other  two  methods  to 
this  problem  is  a  topic  for  future  research. 

Because  we  have  shown  that  any  policy  that  satisfies  Eqs.  (26).  (27),  and  (28)  is  optimal,  this  set 
of  equations  can  be  used  to  test  candidate  policies  for  optimality.  One  way  to  apply  this  observation 
is  to  develop  a  policy  through  some  independent  technique  and  then  to  attempt  to  construct  an 
analytic  proof  that  the  control  applied  for  every  state  at  every  time  is  a  maximizing  control  in 
Eq.  (27).  While  such  an  approach  is  probably  only  feasible  for  relatively  simple  policies,  it  offers 
the  possibility  of  avoiding  a  computational  implementation  altogether. 

Howard  has  developed  a  computationally  efficient  technique  called  policy  iteration  that  takes 
advantage  of  the  fact  that  the  dynamic  programming  equation  can  be  used  to  define  a  contraction 
mapping  in  policy  space  (2].  His  approach  was  developed  for  models  with  a  long-time  horizon  and 
a  reward  function  that  is  formed  as  the  sum  of  a  set  of  stage  rewards.  Beginning  with  an  arbitrary 
control  function  and  an  arbitrary  assignment  of  rewards  for  each  state,  the  policy  iteration  algorithm 
Iteratively  applies  Eq.  (27)  to  develop  a  more  nearly  optimum  control  function  and  the  associated 
rewards.  Although  this  “policy  iteration”  must  continue  until  a  fixed  poiiit  is  reached,  Howard 
reports  that  the  algorithm  often  converges  after  just  a  few  iterations. 

Howard’s  policy  iteration  algorithm  computes  each  succeeding  control  function  by  solving  a  set 
of  linear  equations.  For  reward  functions  formed  as  a  product  of  a  set  of  stage  rewards,  such  us 
we  have  In  our  model,  it  is  not  clear  that  a  practical  analogue  to  this  approach  can  be  developed . 
Furthermore,  we  would  expect  that  reformulation  of  our  model  to  incorporate  a  longer  time  hori¬ 
zon  would  significantly  alter  the  optimal  policy  choice.  Nonetheless,  it  may  prove  worthwhile  to 
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investigate  further  the  application  of  policy  iteration  to  this  class  of  reward  functions  given  the 
problems  that  we  describe  below  with  implementation  of  the  value  iteration  technique. 

Equations  (26),  (27),  and  (28)  can  also  be  used  directly  to  calculate  the  optimal  control  for 
each  state  at  each  time.  Conceptually,  this  approach  is  quite  simple.  First,  the  results  of  Eq.  (26) 
are  computed  and  stored  for  each  state.  Equation  (27)  is  then  applied  to  each  stale,  iterating  this 
operation  backwards  in  time  until  time  0  is  reached.  At  each  time  the  optimizing  policy  for  each 
state  is  stored  in  an  array.  Finally,  the  given  distribution  on  the  initial  state  is  used  to  find  the 
expected  reward  by  applying  Eq.  (28).  This  technique  is  known  as  value  iteration.  In  the  remainder 
of  this  report  we  describe  the  implementation  details  and  the  resulting  computational  complexity 
of  value  iteration  dynamic  programming. 

5.1  State  Coding 

Because  each  state  must  be  considered  at  each  time,  reducing  the  cardinality  of  the  state  space 
proportionally  reduces  both  time  and  space  requirements.  For  this  reason,  redundant  informa¬ 
tion  should  be  removed  from  the  state  before  coding  the  algorithm.  Redundant  information  is 
information  that  would  not  affect  the  value  of  the  reward  function  if  it  were  deleted. 

Two  types  of  redundant  information  exist  in  our  model.  The  most  obvious  type  is  represented 
by  unreachable  states.  Certain  state  variable  combinations  will  never  be  reached,  regardless  of 
the  control  policy  or  the  outcome  of  random  events.  For  example,  in  the  engagement  model  it  is 
not  possible  for  a  SAM  to  be  in  flight  (5  >  0)  when  the  SAM  inventory  is  at  its  maximum  value 
(N  =  No)  since  launching  a  SAM  decrements  N.  Other  state  variable  combinations  can  occur  at 
some  times,  b,  not  at  others.  For  example,  since  only  one  SAM  can  be  launched  at  each  time 
step,  N  cannot  reach  0  until  No  time  steps  have  elapsed.  So  the  set  of  unreachable  states  varies 
with  time. 

A  more  subtle  type  of  redundant  information  is  represented  by  states  that  can  be  excluded 
without  changing  the  value  of  the  reward  function,  although  the  policy  that  achieves  that  reward 
function  may  change.  We  call  these  states  redundant  states,  because,  although  they  are  initially 
reachable,  the  optimization  problem  can  be  reformulated  in  a  way  that,  makes  them  unreachable. 
Often,  more  than  one  policy  will  result  in  the  same  reward.  We  define  policies  that  result  in  the 
same  reward  to  be  equivalent  and  use  this  relation  to  partition  the  policy  space  into  equivalence 
classes.  By  introducing  additional  constraints  on  the  control  function,  we  may  reduce  the  size  of 
some  of  these  equivalence  classes  while  increasing  the  number  of  unreachable  states.  As  long  as  one 
policy  remains  in  the  optimizing  equivalence  class,  the  dynamic  programming  algorithm  presented 
_ above  will  find  it.  . . . .  j  . . . 

A  simple  example  may  help  to  clarify  this  somewhat  abstract  concept.  Consider  an  instance  of 
the  engagement  model  with  one  ASCM  that  is  known  to  arrive  at  time  0,  one  SAM  (i.e.,  No  =  1), 
and  no  chaff  (i.e.,  Co  >  r).  Since  Pk  is  constant,  the  one  SAM  could  be  launched  at  any  time  that 
would  allow  it  to  reach  the  ASCM  while  the  ASCM  is  in  the  middle  zone.  Every  such  policy  would 
result  in  the  same  reward  in  the  model  we  have  presented.  So  every  policy  with  exactly  one  SAM 
launch  between  time  0  and  time  tm  -  ^  is  in  the  optimizing  equivalence  class.  Introducing  the 
constraint  (t  <  tm  -  p£)  =>  L  =  Hold  reduces  the  optimizing  equivalence  class  to  a  single  “Don’t 
shoot  until  you  see  the  whites  of  their  eyes”  policy  and  makes  every  state  with  5  >  0  unreachable 
between  time  0  and  time  tm  - 

Because  we  have  not  developed  a  general  procedure  to  recognize  redundant  states  and  craft 
control  constraints  that  make  them  unreachable  while  preserving  at  least  one  policy  in  the  opti¬ 
mizing  equivalence  class,  an  ad  hoc  approach  has  been  adopted  here.  As  our  example  shows,  the 
choice  of  constraints  and  the  resulting  set  of  unreachable  states  depends  on  the  values  of  the  mode! 
parameters  (consider  how  different  the  result  would  have  been  with  No  --  2).  For  this  reason,  we 
have  chosen  not  to  pursue  the  issue  of  redundant  states  further. 
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Unreachable  states  are  more  easily  recognized  and  eliminated  because  no  new  control  constraints 
are  required.  In  the  engagement  model,  7}  €  Ot  until  the  first  seduction  attempt  occurs  with  ASCM 
*  In  the  middle  zone.  But  knowledge  of  7)  is  never  required  to  find  the  next  state,  and  it  is  never 
used  to  compute  rjt-  So  we  can  calculate  J*  by  using  Eqs.  (26),  (27),  and  (28)  without  ever  knowing 
Tj.  This  means  that  by  simply  eliminating  7)  from  the  state  (or  equivalently  forcing  it  to  a  fixed 
value)  will  change  neither  the  policy  nor  the  value  of  the  inward  function. 

Similarly,  C  can  be  eliminated  from  the  state  because  it  depends  in  a  known  way  on  time, 
regardless  of  the  policy  employed.  It  is  included  in  Fig.  8  simply  as  a  notational  convenience.  Once 
Co,  Cmax  and  t  are  specified;  we  can  compute  C  as: 

(1<C0)  =>  C  -  Co -l 

(t  >  Co)  =>  C  =  (Co  -  t)  mod  Cninx  ■ 

The  remaining  state  variables  (N,  S,  D ,  Ai,  and  7J)  are  all  influenced  by  the  control  policy  and 
are  used  in  the  computation  of  the  reward  function.  The  variables  A,  and  7}  are  used  directly  to 
find  §t(Ot),  while  D  is  used  to  find  7),  S  Is  used  to  find  D  and  to  find  the  distribution  on  Ai,  and 
N  is  used  to  constrain  the  control  function.  It  is  still  possible  to  find  combinations  of  these  state 
variables  that  cannot  occur,  however,  as  our  first  example  in  this  section  demonstrates. 

Since  the  set  of  reachable  states  changes  over  time,  a  thorough  search  for  unreachable  states 
would  require  extensive  analysis.  The  most  general  approach  to  eliminating  unreachrble  states 
would  be  to  enumerate  the  reachable  states  and  map  each  reachable  state  to  an  index  in  this 
enumeration.  As  was  the  case  for  redundant  states,  however,  unreachable  states  can  be  difficult  to 
recognize  in  advance.  We  cannot  use  the  dynamic  programming  algorithm  to  discover  which  states 
are  unreachable  because  the  algorithm  proceeds  backwards  in  time.  Once  we  discovered  that  a  state 
was  unreachable,  the  work  we  sought  to  avoid  would  already  have  been  done!  A  tree  walk  would 
discover  which  states  are  reachable,  but  it  would  also  find  the  optimal  policy  (at  great  expense), 
obviating  the  need  for  dynamic  programming.  So  again  we  are  reduced  to  an  ad  hoc  approach  to 
eliminating  unreachable  states. 

Programmers  must  also  balance  the  potential  savings  obtained  front  encoding  the  state  against 
the  time  and  space  required  by  the  coding  and  decoding  operations.  Unreachable  state  variable 
combinations  that  have  simple  specifications  in  terms  of  existing  state  variables  are  easily  elim¬ 
inated.  Those  that  require  complex  specifications  or  computationally  expensive  transformations 
on  the  state  space  may  not  be  worth  eliminating  even  if  they  could  be  specified.  Using  this  ob¬ 
servation  to  guide  our  search,  we  have  found  no  combinations  beyond  the  example  given  above 
((N  =  No)  =>  S  =  0)  that  apply  for  all  possible  model  parameters. 

5.2  Data  Structures 

The  dynamic  programming  algorithm  iteratively  uses  the  expected  partial  reward  for  every 
state  to  find  the  expected  partial  reward  for  each  state  one  time  step  earlier.  Because  the  partial 
rewards  represent  an  expected  probability  of  survival,  they  are  real  numbers  between  0  and  1.  So 
we  will  need  at  least  two  floating  point  arrays.  No  more  than  two  arrays  are  needed  because  only 
the  present  partial  rewards  are  required  to  find  the  partial  reward  for  each  state  one  time  step 
earlier.  We  simply  alternate  between  the  two  arrays  to  represent  the  present  time  step  as  we  iterate 
backwards  in  time. 

If  our  goal  were  simply  to  discover  the  optimal  expected  probability  of  survival,  these  two  arrays 
would  be  sufficient.  The  dynamic  programming  algorithm  will,  however,  also  identify  a  policy  that 
ri^ults  in  this  optimal  reward.  Storing  that  policy  requires  an  array  index  si  by  both  time  and 
state.  Because  the  launch  control  is  binary  valued,  a  Boolean  array  will  sulfw 
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5.3  Algorithm  Implementation 

While  Eqs.  (26),  (27),  and  (28)  can  be  used  to  compute  the  optimal  policy,  two  simplifications 
can  reduce  the  implementation  effort.  If  we  define  Vt+a^t+a)  --  1  VOt+a.  Eq.  (26)  becomes 
a  special  case  of  Eq.  (27).  Unnecessary  computations  are  easily  avoided  by  choosing  an  arrival 
distribution  with  no  arrivals  at  time  r  and  adopting  a  constraint  that  prevents  SAM  launches  at 
time  r.  Since  Oq  is  known,  Eq.  (28)  can  be  replaced  with  JV  =  Vq{Oq)-  So  all  we  must  compute 
is: 

Vt{pt)  =  max  {  E  [^+a(4+a)  •  9t{Ot)\  £t(Ot)]},  (t  e  {0, . . .  ,r}).  (29) 

{£«(&)}  {Ot+A} 

To  compute  this  quantity,  we  must  find  the  distribution  on  the  next  state  given  the  present 
state  and  the  control.  At  each  time,  Eq.  (3)  can  be  used  to  compute  the  probability  of  detecting 
any  particular  number  of  ASCMs  for  each  state.  Figure  8  can  also  be  used  for  each  state  to  find  the 
probability  that  a  SAM  will  kill  the  closest  ASCM.  Because  we  have  modeled  these  as  independent 
events,  we  can  then  compute  the  probability  of  each  possible  random  outcome.  For  each  possible 
control  the  other  state  variables  make  deterministic  transitions  that  are  specified  in  Figs.  8  and  9. 

Only  control  values  that  satisfy  the  constraints  on  the  launch  control  need  be  considered.  We  do 
not  allow  another  SAM  to  be  launched  if  a  SAM  is  already  in  flight  ( S  >  0),  the  SAM  inventory  is 
exhausted  (N  =  0),  or  the  final  time  has  been  reached  ( t  =  r).  If  one  of  these  conditions  holds,  we 
only  consider  L(t)  =  False.  Otherwise,  we  choose  the  control  that  maximizes  Eq.  (29).  Specifically, 
we  compute: 

Vl[statelj  =  max  {  Y'  Pr{transition  from  statel  to  state2  ;  L[statel]  } 
{Lfstatel]}  state2 

•V2(state2]  •  J]  (1-7)  in  statel)}  (30) 

o^i=o  in  statel 

where  VI  is  the  array  of  partial  rewards  at  the  prior  time,  V2  is  the  array  of  partial  rewards  at  the 
present  time,  and  L  is  the  slice  of  the  policy  array  for  the  prior  time. 

We  are  now  ready  to  tie  up  the  loose  ends  and  present  the  algorithm.  The  time  variable  t  is 
initially  set  to  r,  and  every  value  in  V2  is  set  to  1.  VI (state  1)  is  then  calculated  bv  using  Eq.  (30) 
for  each  state  1  at  time  t  and  the  maximizing  Ljstate  1]  (either  True  or  False)  is  stored  in  the  policy 
array  for  state  1  at  time  t.  Once  Vl[state  1]  has  been  computed  for  every  state,  V2  is  replaced  by 
VI,  t  is  decremented  by  A,  and  the  process  repeats  using  Eq.  (30).  This  time  VV-a(  )  and  At-a(  ) 
are  computed.  The  iteration  continues  until  the  pass  for  t=0  has  been  completed.  The  value  of 
Jn>  can  then  be  read  from  VI  [state  1]  for  the  value  of  state  1  that  corresponds  to  the  known  initial 
value  of  Oq.  The  policy  array  will  contain  a  specification  for  a  SAM  launch  policy  that  would  result 
in  this  expected  probability  of  survival. 

5.4  Computational  Complexity 

The  dynamic  programming  algorithm  requires  that  for  each  control  all  of  the  states  that  can 
be  reached  from  every  state  at  every  time  be  considered.  Therefore,  the  number  of  operations  that 
must  be  performed  is  directly  proportional  to  the  number  of  controls  times  the  fanout  from  each 
state  times  the  number  of  states  times  the  number  of  time  steps. 

The  fanout  is  the  number  of  states  that  can  be  reached  from  a  given  state  in  one  time  step  for 
a  specified  control.  Since  at  most  m  -f  1  ASCMs  can  arrive  simultaneously  and  states  that  include 
an  intercept  (i.e.,  S  =  0)  could  lead  to  one  of  two  different  states  for  each  possible  number  of 
arrivals,  the  maximum  fanout  is  2(m  -f  1).  Most  states  do  not,  however,  include  an  intercept  and 
most  restrict  the  maximum  number  of  arrivals  to  fewer  than  m  ASCMs  because  some  ASCMs  have 


On  the  Integrated  Scheduling  of  Hardkill  and  SoJUdil 


Zi 


already  been  detected.  For  this  reason,  is  probably  a  closer  estimate  of  the  average  fanout. 

For  the  parameters  in  Table  2,  an  average  fanout  of  approximately  2  would  be  expected. 

The  number  of  states  turns  out  to  be  very  large: 

At  There  are  useful  values  for  each  Ai.  Those  values  are  {-oc,  A,  1, ,  tm,  oo}.  Other 
positive  numbers  do  not  occur.  All  negative  numbers  can  be  mapped  to  -oo  because  doing 
so  preserves  the  effect  of  Ai  on  each  branch  of  every  transition  diagram  in  Figs.  8  and 
9.  Since  there  are  m  ASCMs,  there  are  ( -^+3))m  possible  values  for  {Ai,...,  Am}.  The 
parameters  in  Table  2  result  in  51  possible  values  for  each  Ai,  or  1.3  x  105  possible  values  for 
{A|, . .  • ,  Am}' 

D  There  are  +  1  useful  values  for  D.  Those  values  are  {0.  A, ... ,  Dmax).  Negative  numbers 
do  not  occur.  Other  positive  numbers  can  be  mapped  to  Dmax  because  the  definition  of 
Dmax  ensures  that  (D  >  Dmax)  =>  Ps(D)  =  Ps[Dmax)-  The  parameters  in  Table  2  result  in 
9  possible  values  for  D. 

N  There  are  Nq  +  1  possible  values  for  N.  They  are  {0, 1, ....  Mb}*  The  parameters  in  Table  2 
result  In  6  possible  values  for  N. 

S  There  are  Isj^+ijJ  +  2  useful  values  for  S.  These  values  are  {-oo,  0,  A, ... ,  Figure 

8  shows  that  the  largest,  possible  value  for  S  is  t](t)  —  t.  Substitution  of  the  maximum  possible 
value  for  Ai  in  Eq.  (4)  yields  for  this  value.  All  negative  numbers  can  be  mapped 

to  — oo  because  doing  so  preserves  tne  effect  of  S  on  each  branch  of  every  transition  diagram 
in  Figs.  8  and  9.  The  parameters  in  Table  2  result  in  17  possible  values  for  S. 

Ti  Since  there  are  ^  +  1  time  steps  and  +  1  time  steps  elapse  between  each  seduction,  there 
can  be  at  most  [ seduction  attempts.  The  transition  diagram  in  Fig.  9  shows  that 
Ti  will  be  multiplied  by  a  function  of  D  (which  could  assume  one  of  +  1  values)  at 
each  seduction  attempt.  Since  an  unsuccessful  seduction  attempt  (D  =  0)  results  in  the  same 
value  as  no  seduction  attempt,  D  =  0  can  be 'used  to  represent  either  situation.  Because 
multiplication  is  commutative  and  associative,  there  are  as  many  possible  values  of  T,  as 
there  are  choices  (with  replacement)  of  values  of  D  for  each  seduction  attempt.  Since  there 

are  m  ASCMs,  there  are  at  most  m  times  this  number  of  possible  values  for  {T0 . Tm). 

This  value  can  be  computed  as: 


The  parameters  in  Table  2  result  in  at  most  3  possible  seductions  and  9  possible  values  for  D 
at  each  seduction  attempt,  resulting  in  129  possible  values  for  7’,.  or  2.1  x  106  possible  values 
for  {fo,...,fm}. 

Multiplying  the  values  for  the  parameters  in  Table  2  together,  we  find  there  are  over  2.6  x  10'4 
possible  states.  Since  ( N  =  No)  =>  (S  =  0),  a  slight  reduction  to  approximately  2  2  x  1014  states 
can  be  achieved  by  coding  N  and  5  together. 

Because  the  launch  control  is  binary,  the  maximum  number  of  possible  controls  is  two.  The 
constraint  that  precludes  a  SAM  launch  when  a  SAM  is  already  in  flight  makes  the  average  number 
of  controls  much  closer  to  one,  though,  because  S  >  0  in  most  of  the  states.  Finally,  there  are 
2  +  1  possible  time  steps.  For  the  parameters  in  Table  2,  there  are  148  possible  time  steps. 

Combining  these  results,  the  dynamic  programming  algorithm  requires  retrieval  of  approxi¬ 
mately  6.5  x  1016  floating  point  numbers  from  memory  and  requires  us  to  perform  a  proportional 
number  of  arithmetic  and  logical  operations.  Even  with  a  10  gigabyte  per  six-oml  memory  band- 
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width  it  would  take  over  two  months  to  simply  perforin  the  required  memory  accesses.  Each  of 
the  two  floating  point  partial  reward  arrays  contains  approximately  220  trillion  values;  together, 
they  require  about  1800  terabytes  of  random  access  memory  when  stored  with  single  precision. 
The  policy  array  consists  of  approximately  220  trillion  bits  for  each  of  the  148  time  steps,  and 
they  require  over  4000  terabytes  of  offline  storage  during  the  computation  and  an  equal  amount  of 
random  access  memory  during  policy  employment. 

The  strength  of  value  iteration  dynamic  programming  is  -hat  it  provides  an  exact  global  solution 
for  combinatorial  optimization  problems  without  resorting  to  exhaustive  search.  For  the  model  we 
have  developed  above,  the  computational  complexity  of  using  dynamic  programming  in  this  way  is 
obviously  well  beyond  the  capability  of  existing  computer  systems.  In  the  following  discussion  we 
will  describe  two  alternative  approaches  for  resolving  this  dilemma.  First  we  consider  the  potential 
for  restricting  our  model  in  ways  that  will  result  in  improved  implementation  efficiency.  Since  it 
remains  an  open  question  whether  the  approaches  we  describe  below  retain  sufficient  fidelity  to 
allow  application  of  value  iteration  dynamic  programming  to  practical  problems,  we  conclude  with 
a  brief  discussion  of  potential  for  developing  heuristic  optimization  techniques  for  these  problems. 

5.5  Reducing  Computational  Complexity 

The  four  factors  contributing  to  the  complexity  of  the  dynamic  programming  algorithm  are 
the  number  of  states,  the  number  of  available  controls  in  each  state,  the  fanout  from  each  state, 
and  the  number  of  time  steps.  There  is  little  point  to  further  constraining  the  number  of  available 
controls  in  each  state  because  only  one  possible  control  exists  in  most  of  the  states  and  no  more 
than  two  controls  are  possible  in  any  state  in  our  model.  However,  we  can  gain  significantly  by 
reducing  the  number  of  time  steps.  The  number  of  time  steps  can  be  reduced  by  increasing  A 
or  by  introducing  an  assumption  such  as  clustered  arrivals  that  would  allow  r  to  be  decreased. 
Furthermore,  increasing  A  would  significantly  reduce  the  number  of  states  and  clustered  arrivals, 
and  would  also  reduce  fanout.  Thus  we  begin  by  considering  changes  that  reduce  the  number  of 
time  steps  and  then  discuss  other  assumptions  that  could  be  introduced  to  reduce  the  state  space 
still  further. 

5.5.1  Time  Step  Reduction 

The  value  of  A  in  Table  2  results  in  a  separation  between  values  of  I)  that  is  appropriate  to  the 
accuracy  with  which  data  for  Ps(D)  was  collected.  Increasing  A  results  in  coarser  coding  for  £>, 
leading  to  a  less  accurate  estimate  of  seduction  effectiveness.  This  decreased  accuracy  could  lead 
to  a  overdependence  or  underdependence  on  the  chaff,  depending  on  the  type  of  error  introduced. 
Thus,  a  significant  increase  in'  A  could  make  the  results  of  the  optimization  less  useful. 

In  addition  to  reducing  the  number  of  time  steps,  increasing  A  also  reduces  the  number  of 
possible  values  for  every  state  variable  except  N.  For  the  parameters  in  Table  2,  the  number  of 
possible  states  decreases  with  the  eighth  power  of  the  factor  by  which  A  is  increased.  Doubling  A 
would  reduce  the  number  of  possible  states  by  a  factor  of  256.  Coupled  with  the  time  step  reduction, 
the  computational  complexity  is  reduced  by  the  ninth  power,  a  factor  of  512  for  doubling  A.  This 
extreme  sensitivity  to  A  places  great  weight  on  the  choice  of  the  maximum  practical  value  for  this 
parameter. 

The  potential  improvement  resulting  from  reducing  r  is  less  dramatic.  We  consider  an  engage¬ 
ment  to  begin  when  the  first  ASCM  is  detected.  The  value  of  r  in  Table  2  is  sufficient  to  find  the 
optimal  policy  regardless  of  the  subsequent  ASCM  arrival  pattern,  subject  only  to  the  requirement 
that  every  ASCM  shares  the  middle  zone  with  at  least  one  other  ASCM  for  at  least  one  time 
instant.  It  was  found  by  considering  the  case  in  which  each  subsequent  ASCM  is  detected  just  as 
the  prior  ASCM  reaches  the  inner  zone.  We  have  not  considered  the  case  of  ASCMs  that  never 
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share  the  middle  zone  with  another  ASCM  because  that  case  is  easily  handled  by  decomposing  the 
multiple  ASCM  engagement  into  a  sequence  of  single  ASCM  engagements.  The  optimal  policy  for 
the  multiple  ASCM  engagement  could  then  be  found  by  sequentially  finding  the  optimal  policy  for 
each  single  ASCM  engagement. 

Restricting  all  arrivals  to  the  first  few  seconds  of  the  engagement  would  reduce  r  by  a  factor 
between  2  and  3.  This  is  a  reasonable  restriction  because  clustered  arrivals  are  known  to  be  an 
effective  attack  technique.  Restricting  ASCM  arrival  times  in  this  way,  times  also  reduce  the 
average  fanout  from  around  2  to  a  value  close  to  1.  Taken  together,  this  approach  offers  a  factor 
of  5  improvement  in  speed  and  a  factor  of  2.5  improvement  in  the  number  of  states. 

5.5.2  State  Space  Reduction 

Together,  the  two  approaches  to  time  step  reduction  could  yield  a  performance  improvement  of 
about  three  orders  of  magnitude  if,  for  example,  analysis  showed  that  A  could  safely  be  doubled. 
Even  with  that  speedup,  additional  improvement  would  be  required  before  a  practical  implementa¬ 
tion  could  be  developed.  One  approach  is  to  simply  eliminate  the  inventory  constraint.  If  enough 
SAMs  were  available  to  the  defender  to  assure  their  availability  throughout  the  engagement,  the 
choice  of  controls  would  not  depend  on  the  inventory.  Thus  the  state  variable  N  could  be  deleted 
and  the  SAM  launch  constraint  simplified  by  eliminating  the  N  >  0  requirement.  For  the  parame¬ 
ters  in  Table  2,  this  would  reduce  the  number  of  state  variables  by  a  factor  of  6. 

A  bolder  approach  would  be  to  arbitrarily  reduce  the  precision  with  which  a  state  variable  is 
recorded.  While  this  approach  cannot  be  applied  to  counting  processes  such  as  N,  S  and  A,,  it 
offers  great  promise  when  applied  to  T*.  Initial  experiments  with  the  parameters  shown  in  Table  2 
(with  a  restriction  to  clustered  arrivals  to  speed  execution)  show  that  approximately  10  evenly 
spaced  values  are  sufficient  to  compute  a  value  for  that  is  accurate  to  within  0.05  of  the  value 

found  when  exact  values  were  used.  Such  coding  results  in  1,000  possible  values  for  {To . Tm}; 

a  reduction  by  a  factor  of  over  2,000  compared  to  the  number  of  possible  values  for  exact  coding. 
Unfortunately,  the  policy  found  in  this  way  could  differ  significantly  from  the  optimal  policy  found 
by  using  exact  coding.  Additional  research  is  required  to  understand  how  quantizing  T,  affects  the 
optimal  policy  and  to  find  the  best  quantization  algorithm. 

In  summary,  eliminating  the  inventory  constraint  and  clustering  arrivals  would  reduce  the  com¬ 
putational  complexity  by  one  order  of  magnitude  while  retaining  the  ability  to  accurately  model  a 
wide  range  of  realistic  engagements.  By  taking  these  stops  and  doubling  A  and  quantizing  T,  to  10 
values,  we  could  achieve  a  seven  order  of  magnitude  improvement  over  a  model  with  the  parameters 
shown  in  Table  2.  This  improvement  comes  at  the  expense  of  some  loss  of  accuracy  that  remains 
to  be  quantified,  but  it  allows  a  solution  to  be  calculated  for  a  problem  with  parameters  similar  to 
those  in  Table  2  without  placing  impractical  demands  on  computer  resources. 

6  FUTURE  RESEARCH 

Practical  ship  defense  systems  will  require  models  that  incorporate  additional  dynamics  and  a 
wider  range  of  parameters.  In  practice,  ASCM  seeker  parameters  may  not  be  uniform,  and  may  not 
be  known  a  priori.  Other  defensive  systems  must  be  integrated  with  the  two  considered  here.  And 
protracted  engagements  consisting  of  several  waves  of  ASCMs  may  have  to  be  addressed.  These 
requirements  would  yield  larger  state  spaces  and  greater  fanout,  and  may  require  many  more  time 
steps.  Even  the  improvements  of  the  type  Just  described  offer  no  practical  hope  for  dealing  with 
this  sort  of  computational  complexity. 
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The  simplifying  assumptions  in  Section  5.5  alone  will  not  be  adequate  for  models  of  this  com¬ 
plexity.  The  principal  advantage  of  the  dynamic  programming  algorithm  described  in  this  report 
is  that  it  produces  an  optimal  solution  without  considering  the  alternatives  available  in  each  state 
at  each  time  more  than  once.  Its  principal  limitation  is  that  we  must  perform  the  computation  for 
every  state,  regardless  of  whether  the  state  is  reachable.  This  limitation  arises  because  we  work 
backwards  in  time  in  the  djnamic  programming  algorithm,  and  hence  we  are  unable  to  recognize 
unreachable  states  during  the  computation.  A  forward  tree  search  for  the  optimal  policy  avoids 
evaluating  unreachable  states  at  the  cost  of  greater  time  or  space  complexity.  To  apply  dynamic 
programming  to  these  more  challenging  models  would  require  the  investigation  of  other  uses  of 
the  dynamic  programming  equations  such  as  policy  iteration  and  analytic  proofs  of  optimality.  If 
these  techniques  do  not  prove  to  be  feasible,  it  will  be  necessary  to  consider  heuristic  optimization 
techniques. 

A  number  of  heuristic  techniques  that  have  been  applied  to  similar  problems  appear  to  offer  some 
promise  for  application  to  the  ship  defense  problem.  SAMUEL  is  a  computer  program  developed 
by  the  Navy  Center  for  Artificial  Intelligence  Research  that  uses  a  genetic  algorithm  to  develop  a 
specification  for  a  near-optimal  policy  [3].  Policies  in  SAMUEL  are  specified  by  a  set  of  rules  that 
are  used  to  select  controls  based  on  observations.  The  genetic  algorithm  in  SAMUEL  adds,  deletes, 
and  changes  those  rules  in  an  attempt  to  improve  the  value  of  the  reward  function.  Application  of 
SAMUEL  to  the  ship  defense  problem  would  require  the  creation  of  a  world  model  that  is  used  by 
the  simulation  module  in  SAMUEL  to  evaluate  candidate  policy  specifications. 

The  effectiveness  of  the  technique  used  by  SAMUEL  depends  on  the  suitability  of  a  relatively 
small  rule  set  for  specification  of  near-optimal  policies.  Because  we  expect  to  be  able  to  achieve  an 
optimal  dynamic  programming  implementation  for  our  limited  engagement  model  described  above, 
evaluation  of  SAMUEL’S  performance  on  that  model  should  be  straightforward.  Furthermore,  we 
would  hope  to  gain  some  insight  into  the  usefulness  of  policy  specification  using  small  rule  sets. 
Although  the  policy  array  is  generally  far  too  large  for  online  storage,  small  rule  sets  would  be 
useful  in  developing  online  decision  aids  and  tactical  guidance  for  defensive  system  operators. 

Another  technique  worth  investigating  is  a  limited  lookahead  forward  tree  search.  Rather  than 
evaluate  every  path  through  the  complete  state  space,  the  number  of  time  steps  to  be  considered 
is  sharply  limited  and  a  heuristic  evaluation  of  the  partial  reward  is  performed  for  each  state  when 
that  limit  is  reached.  By  limiting  the  depth  of  the  tree  search  we  can  avoid  both  consideration 
of  unreachable  states  and  extensive  reconsideration  of  the  same  state.  This  approach  is  similar  to 
the  technique  used  by  computer  chess  programs;  however,  the  introduction  of  stochastic  transitions 
could  result  in  some  significant  differences  in  the  implementation  of  the  concept. 

Selection  of  an  appropriate  heuristic  for  the  static  evaluation  of  terminal  states  is  critical  to 
the  performance  of  the  limited  look  ahead  technique.  One  possible  heuristic  would  be  to  use  the 
same  reward  that  SAMUEL  would  use  to  evaluate  candidate  rule  sets.  SAMUEL  determines  the 
performance  of  a  set  of  rules  from  a  known  starting  state  by  Monte  Carlo  simulation.  To  use 
SAMUEL’S  approach  for  static  evaluation  of  an  intermediate  state,  SAMUEL  would  first  be  used 
to  find  a  good  set  of  rules,  and  then  SAMUEL’s  simulation  module  would  be  applk  1  to  the  state 
being  considered  to  find  the  expected  performance  of  that  set  of  rules  starting  from  the  specified 
state.  An  advantage  of  the  limited  look  ahead  technique  is  that  it  might  be  possible  to  apply 
it  in  real  time  rather  than  developing  a  policy  off-line.  If  rules  turn  out  to  be  poorly  suited  for 
specifying  near-optimal  policies,  real-time  policy  development  may  be  the  only  practical  alternative. 
The  availability  of  an  optimal  dynamic  programming  implementation  for  the  engagement  model 
would  also  facilitate  performance  evaluation  in  this  case. 
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A  third  approach  that  could  be  useful,  If  small  rule  sets  are  not  found  to  be  effective,  is  to 
develop  a  neural  network  that  could  generate  near-optimal  controls.  In  one  such  approach,  a 
backpropagating  neural  network  could  be  trained  off-line  by  using  an  optimal  dynamic  programming 
solution  for  a  limited  engagement  model  and  then  it  could  be  used  in  real  time  to  generate  the 
optimal  controls.  Werbos  has  proposed  using  backpropagation  through  time  to  train  the  world 
model  of  a  backpropagating  adaptive  critic  (4). 

Finally,  Hopfield  neural  networks  offer  the  possibility  of  directly  developing  an  optimal  policy, 
if  a  short  binary  specification  for  a  policy  could  be  developed.  The  rule  set  developed  by  SAMUEL 
may  provide  some  insight  into  the  development  of  an  appropriate  policy  representation. 

7  CONCLUSION 

Because  value  iteration  dynamic  programming  requires  that  we  work  backward  in  time,  we 
must  consider  every  combination  of  state  variables  at  every  time.  Although  we  may  be  able  to  find 
an  optimal  solution  for  a  small  problem  using  dynamic  programming,  solving  problems  based  on 
larger  models  requires  more  computer  resources  than  can  practically  be  provided.  For  this  reason 
we  are  motivated  to  search  for  faster  techniques.  Both  heuristic  techniques  and  alternative  uses 
of  the  dynamic  programming  equations  bear  further  investigation.  The  value  iteration  dynamic 
programming  solution  we  have  developed  is  useful,  however,  since  it  can  be  used  to  gain  insight 
into  the  design  of  such  algorithms  and  to  evaluate  the  performance  of  those  algorithms  in  a  restricted 
domain. 
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Appendix  A 


DYNAMIC  PROGRAMMING  WITH  MULTIPLICATIVE  REWARDS 


A.l  GIVEN 


We  are  given  an  initial  time  0,  a  constant  time  quantum  of  1  arbitrary  unit,  a  nonnegative  integer 
final  time  r,  random  variables  {Xo, . . . ,  Xj}  talcing  values  {xo, . . . ,  xr  }  that  represent  states,  a  class 
of  control  policies  it  =  {po(-), . . . ,  Vt-iO)},  and  a  reward  function  J„  with  the  following  properties: 

(a)  xt  €  Xt  V(t  €  {0,...,r}),  where  Xt  is  a  finite  set  that  represents  the  possible  states  at 
time  t. 

(b)  m  :  Xt  -*  WV((  €  { 0,...,r  -  1}),  where  U  is  a  finite  set  that  represents  the  available 
controls.  Furthermore,  Ht(xt)  G  Ut (x<)  V(t  G  {0, . . .  ,r  -  l}),V(xt  g  Xt),  where  Ut(xt)  is  a 
finite  set  that  represents  the  admissible  controls  at  time  t  if  the  state  is  xt- 

(c)  The  distribution  on  X0  is  given  and  the  distribution  on  Xt+i  is  completely  determined 
by  xt  and  /it(xt)  Vt  €  {0, ...  ,r  -  1}: 

Pr{Xt+i  —  *t+i|*o  =  Xo,  •  •  •  ,Xt  =  Xt\n o(')i  •  ■  •  iPri)} 

=  Pr{Xt+i  =  xt+i|Xt  =  xt\ nt(xt)} 


(d) 

(e) 


By,  for  example,  Pr{Xt+i  =  xt+i\Xt  =  xt\Ht{xt)}  we  mean  the  probability  of  the  event 
{-Xt+i  =  xt+i}  conditioned  on  the  event  {Xt  =  Xt},  given  the  value  of  Ht{xt)  as  a 
parameter. 

At  each  time  t  €  (0, . . .  ,r),  a  nonnegative  reward  factor  gt[xt)  that  depends  only  on  the 
present  state  xt  is  computed. 

The  overall  reward  is  the  expectation  of  the  product  of  the  reward  factors  (and  is  a 
function  of  the  control  policy  n): 

T 


Jr  = 


E 

{Xo . XT) 


m  •<(*)] 

t=0 


Our  goal  is  to  find  the  maximum  reward  that  can  be  earned  over  all  admissible  policies  and  a 
policy  which  earns  that  reward.  Formally: 

J*  =  max{  E  (J]S((X,)]} 

it*  =  argi:vax{  E  in^Vt)]} 

*  {Xo,...,XT}  {_0 

Where  we  maximize  over  it  we  mean  maximization  over  Ht(xt)  €  Wt(xt),  V(t  €  {0, . . . ,  r- 1 }),  V(x<  € 

Xt). 
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A.2  CLAIM 

Property  (e)  defines  a  reward  function  with  a  multiplicative  structure.  For  reward  func¬ 
tions  with  an  additive  cost  structure,  the  dynamic  programming  algorithm  described  by  Bertsekas 
in  |5j  provides  a  solution  technique  with  a  time  complexity  that  is  O(t).  Here  we  describe  an 
analogous  algorithm  for  a  multiplicative  reward  function. 

We  define  a  set  of  functions  Vt{xt)  as: 


Vr(*r)  =  9t(xt) 

Vt(xt)  =  9t{xt)  •  max  {  E  |Vj+i(*t+i)|}.  (*  €  {0 . r-1}). 

#**(*«)  {X«+1> 

and  then  claim  we  can  compute  as 


=  E  (Vb(^o)]. 

{Xo) 


Furthermore,  we  define: 

*r  =  0 

ftt{xt)  =  arg  max{  E  IVi+K^+j)]},  (f  €  {0, . . .  ,r  -  1}) 
♦**<*«)  {x,+,} 

=  {£»(•)} U *t+i.  (*€  1}), 


where  0  represents  the  empty  set,  and  claim  that 


it*  =  iro- 


% 


A.3  PROOF 


We  begin  with  the  special  case  of  r  =  0.  Here  the  result  follows  directly  from  the  definitions 

j; ,  .0 

|  |  =  max{  E  |]"Is<(*t)|} 

\Xq]  i=_o 

I  =  E  >(*o)] 

i  {Xo} 

I  =  e  mx0)). 

{Xo} 

Thus  we  may  focus  our  attention  on  the  case  in  which  r  >  1 .  We  will  show  by  induction  on  n 
thatVn€  {0,  —  ,r  —  1} 


E  IV0(X0)|  = 


max...  max  {  E  |K,+i(*r.+i) 
mo()  *.<•)  {X0 . x„+l) 


n 


n  *(*«)] 

t=0 


For  the  basis  case  we  will  take  n  =  0  and  show  that 


E  [Vb(Xo)] 


max{  E  l^i(Xi)  •  J! Ptf-Vt)] 

MoO  {Xo.Xi } 

max{  E  |V,(X,)  <*>(*„)!. 

Mo(  )  {Xo.Xi) 
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Expanding  the  left  side  by  using  the  definitions  of  Vfo  and  expectation,  bringing  two  nonnega¬ 
tive  constants  inside  a  maximization  and  a  summation,  and  then  interchanging  maximization  and 
summation 


E[Vo(*o)l 

{*0} 

=  E>(*b)-  BK  l  E  (Vi(Xi)]}] 

{Xo}  /‘o(Xo)  {Xi} 

52  Pr{A"o  =  so}  -go(xo)  •  max  {  52  Pr{*i  =  ®>l*o  =  *o;Mo(®o)}  •  V'i(xj)} 
*olxo  M(I0>  *»€*> 


?=  52  max  {  52  Pr{-*i  =  *il*o  =  xo‘,im(xo)}  ■  pr{-Y0  =  a?o>  *  Vj(xi)  •  go(x0)} 

xq€Xo^Xo)  *i€*i 

=  max{  52  52  P'W  =  *il*o  =  *0; /*(*<,)}  •  pr{X0  =  x0}  •  Vj(xi)  •  <?o(xo)} 

*o€Ab*ieXi 


This  last  step  requires  some  justification.  With  the  maximization  inside  the  summation  we  are 
separately  choosing  the  values  of  ft o(xo)  that  maximize  each  term  of  the  summation.  Maximizing 
outside  the  summation  requires  us  to  simultaneously  choose  values  of  no(  )  for  every  possible  value 
of  xo  that  maximize  the  sum.  In  this  case  the  two  operations  are  equivalent  because  each  choice 
of  fto(xo)  affects  at  most  one  term  of  the  sum.  In  particular,  property  (c)  ensures  that  Pr{-Yi  - 
=  xo;/xo(*o)}  depends  on  no  other  choice  of  po(  ).  Also  note  that  Vj (xi)  does  not  vary  with 
/io(xo)  because  no  information  for  earlier  times  is  used  in  the  construction  of  Vj.  In  general,  we  can 
exchange  maximization  and  summation  in  this  way  whenever  the  index  of  the  outer  summation 
assumes  values  from  the  domain  of  the  function  over  which  we  are  maximizing;  each  term  in  the 
summation  depends  on  the  value  of  the  function  over  which  we  are  maximizing  for  at  most  one 
element  of  its  domain,  and  the  value  of  the  function  for  each  element  in  its  domain  affects  at  most 
one  term  in  the  summation. 

To  complete  the  proof  of  the  basis  case  we  apply  the  definitions  of  conditional  probability  and 
expectation 


E  [V0<*0)1  -V 

Xo} 

=  max{  52  52  pr{*i  =  *,,*>  =  *o;W>(*o)}  ■  ^(x.)  •  ffo(*o)} 

^  xoexoxiexi 

=  max  E  (Vj(xi)so(xo)]}. 

W(')  {Xo.Xi} 


We  now  take  as  our  inductive  hypothesis  for  n  €  {1, . . . ,r  —  1} 


n-l 


ivEWX0)l=max...  max  {  E  v  \Vn(Xn)  • 

{Xo}  Mo(-)  Pw-l(')  {Xo*...»Xn}  j_q 


and  show  that 


E  [Vo(Ab)l  =  max... max { 
{Xo}  #«0()  !*»(■) 


{Xo, 


n 


•»XB+.}  I=0 


We  begin  by  separating  the  expectation  on  the  right  side  of  the  inductive  hypothesis,  bringing 
a  constant  outside  the  inner  expectation,  expanding  Vn,  and  bringing  two  nonnegative  constants 
inside  the  Inner  maximization: 
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e  mxo)\ 

yXo) 


=  max...  max 

M0()  Mn 


|Vn(Xn)  -nS^)]) 

-*(•)  {X0,  .X,,}  £:0 

r-l 

max...  max  {  E  |  E  [V„(^>  •  T[  5<(**)!l} 

Mo()  *.-»(•)  {XB}‘{Xo . X,-i}‘ 


n-1 


=  max...  max  {  E  ir„(Xn)-  E  JE[ 


n-1 


»  max...  max{  E  \gn(Xn)-  max  {  E  iKl+,(X„+1)|)}  E  I]1  J7t(ATt)|]> 

W(')  /*n-l(‘)  {X«}  fln(Xn)  {Xn+l}  {Xo . X„_i)  j_q 

=  max...  max  {  E  (max  {  E  [Vni.x(Xn+\)\  ■  gn(Xn)  ■  E  iTl 

W(')  1(')  {Xn}  {Xn+l}  {Xo....,Xn  - 1 )  j_q 


Now  observe  that  the  expectation  over  X„  is  taken  on  an  expression  that  depends  on  only  one 
value  of  Hn{Xn)-  Viewing  the  expectation  as  a  weighted  sum,  the  index  of  the  summation  assumes 
values  from  the  domain  of  the  function  over  which  we  are  maximizing;  each  term  in  the  summation 
depends  on  the  value  of  the  function  over  which  we  are  maximizing  for  at  most  one  element  of  its 
domain,  and  the  value  of  the  function  for  each  element  in  Its  domain  affects  at  most  one  term  in 
the  summation.  So  we  can  exchange  that  expectation  with  the  inner  maximization.  Doing  this  and 
moving  constants  inside  expectations 

E  |Vo(*o)]  =  max-  -max{  E  1  E  !Vn+i(Xn+i)|  <7„(X«)  •  E  ill 

{X0}  «(•)  Mn(  )  {X„>  {X„  +  ,}  (Xo . Xn  l)  |J} 

n-1 

=  m&x--max{  B  |  E  (v;.+!  (*„+,)■  »„(*„)  •  E  infill} 

#*o()  M"(  )  {x„)  {xn  +  )}  (x0 . X„-i)  ,_o 

=  max  --max{  E  I  E  |  E  |Vn+i(*»+i)  •  9n{Xn)  ■  f[  St(*«)ll]}- 
»(■)  #*«(•)  {X„}  {X„  +  )}  {Xo . xn_,)  ,=o 


Combining  the  expectations  and  the  product  of  the  reward  factors  completes  the  induction 

E  |Vb(X0)i  =  max  •  ■  •max{  E  IK»+i(Xni  »)  ■  TI 
{Xo}  Mo(-)  .  e»('<  {Xo,....Xn+i}  j_0 


Choosing  n  =  r  -  1  and  applying  the  definitions  of  VT  and  n 

ElVb(Xo)|  =  max--  max{  Ev  |Vr(Xr)-n  *«(*«)]> 

{Xo}  Mo(-)  **T- »(')  {Xoi—.X^}  ,_q 


r-i 


=  max{  E  \9t(Xt)-  n  fft(-Vt)!} 

1*^0 . *t)  1=0 

=  m,axi  e  in*<*>i} 

1^0 . /Ifl  i=0 

=  *>»• 

This  proves  the  first  claim.  The  proof  of  the  second  claim  follows  the  same  approach  with  more 
attention  to  the  identity  of  the  maximizing  control  functions 


On  At  Integrated  Scheduling  of  HardkiU  ard  Sofikill 


37 


Appendix  B 


EQUIVALENCE  OF  THE  SIMPLIFIED  POLICY 


We  claim  that  maximizing  J*  is  equivalent  tp  maximizing  J*  in  the  sense  that  for  any  optimal 
policy  it*  there  is  an  optimal  policy  #*  such  that 

«/*•  =  Jw* 

To  prove  this,  it  suffices  to  show  that  Vo0o)  =  Vo(/o)  VOo  V7o  =  Oq.  To  accomplish  this,  we 
will  prove  by  induction  backwards  on'f  from  r  to  0  by  A  that 

Vt{Ot)  =  vJ  e  {0, . . . ,  r}  VO<  V7t  3  0t  =  T(t,  7t). 

*  -  V 

For  the  basis  case,  we  begin  with  Vt{Ot)  and  apply  Eqs.  (26),  (20),  (19),  (16)  and  (17)  to  get 

VV0r)  =  II 

{o^i(r)=o} 

=  Vr(Ir) 

which  proves  the  basis  case.  For  the  inductive  step  we  assume 

Vt+&(Ot+&)  =  Vt+&.(h+ a)i  VOt+A  V(/t+A  3  Ot+A  =  T(t  +  A,/t+^)),t  €  {A, . . .  ,r  -  A} 
and  must  show  that 

Vt(Ot)  =  WOt  V(7t  3  0t=  T(t,It)). 

Let  Ot  be  arbitrary.  Then,  by  Eqs.  (18),  (16),  (19),  (20)  and  the  inductive  hypothesis 
V,(/«)  =  max{(  E  lVt+A(/t+A).  J]  (l-Ti(t))!} 


<*«(/«)  {i,U} 


i5Ai(t)=0 


=  max{  E  (V(+A(Ot+A)-  II  d  -Ti(0)J>- 

*f/§)  {/l+A}  O/»i(t)=0 

. 

...  1.  ‘  _ 

Observing  that  the  function  inside  the  expectation  depends  on  no  aspects  of  Jt+ a  other  than 
<$t+A  and  that  T(t  -f  A,  •)  induces  a  partition  on  It+ a<  we  can  now  write:  ! 


Vt(/t)  =  max{  E  (W^+a)  •  II  (1  -Ti(O)l)  ! 

{<Val  oA7<t)=o  | 

•  i 

When  we  write  pt(7t),  we  mean  the^ control  7-(t)  that  is  selected  at  time  t.  Together,  Ot  and 
L(t)  determine  the  distribution  on  Oj+a-  Observe,  however,  that  the  value  of  the  expectation 
is  a  function  of  the  value  of  Ot  and  the  distribution  on  Ot+ a.  So,  while  different  controls  may 
maximize  the  expectation  for  the  same  Ot,  all  must  result  in  the  same  maximal  value.  Thus, 


t 


f 
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any  maximizing  control  may  be  chosen  arbitrarily.  Therefore  we 
write 


max  {  E  (VJ+a^+a)  • 
A«(d«)  {6i+a} 


n  < 


Vt{Ot) 


Oard,  Wolk,  and  Ephrtmldes 
apply  Eqs.  (25)and  (27)  to 

l-Ti(t))] 


as  was  to  be  shown. 


